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^ Abstract 

o 

Penalized estimation principle is fundamental to higli-dimensional problems. In the liter- 
ature, it has been extensively and successfully applied to various models with only structural 
parameters. On the contrary, in this paper, we first apply this penalization principle to a lin- 

H 

ear regression model with both finite-dimcinsional structural parameters and high-dimensional 
sparse incidental parameters. For the estimated structmal parameters, we derive their consis- 
^ tency and asymptotic distributions, which reveals an oracle property. However, the penalized 

I I estimator for the incidental parameters possesses only partial selection consistency but not con- 

sistency. This is an interesting partial consistency phenomenon: the structural parameters are 
consistently estimated while the incidental parameters can not. For the structural parameters. 



> 

o 



also considered is an alternative two-step penalized estimator, which improves the efficiency of 
the previous one-step procedure for challenging situations and is more suitable for constructing 
confidence regions. Further, we extend the methods and results to the case where the dimension 
CN of the structural parameters diverges with but slower than the sample size. Data-driven penalty 

^ regularization parameters are provided. The finite-sample performance of estimators for the 

structural parameters is evaluated by simulations and a real data set is analyzed. Supplemental 

H materials are available online. 
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1 Introduction 



Since the pioneering papers by Tibshirani (1996) and Fan and Li (2001), the penahzed estimation 
methodology for exploiting sparsity has been studied extensively. For example, Zhao and Yu 
(2006) provide an almost necessary and sufficient condition, namely Irrepresentable Condition, for 
the LASSO estimator to be strong sign consistent. Fan and Lv (2011) show that an oracle property 
holds for the folded concave penalized estimator with ultrahigh dimensionality. For an overview on 
this topic, see Fan and Lv (2010). 

All the aforementioned papers consider the estimation of a structural parameter v in the sense 
that each data point depends on the same entries of v. In contrast, in this paper, we consider 
another type of model where each data point depends on a different set of entries of u. Specifically, 
data {^i,5^i}f=i follow the linear model: 

Yi = f,t + Xj(3* + ei, (1.1) 

where the incidental parameter /x* = (^*, • • • ,/U*)-^ is sparse, the structural parameter /3* = 
{l^i, • • • , /3^)"^ is of main interest, {Xi} are d-dimensional observable covariates, and {ej} are random 
errors. Let u = {fj,*^ , (3*'^)^ . Then, in model (1.1), a different data point {Xi,Yi) depends on a 
different subset of u, that is, fi* and (3*. 

Model (1.1) arises from Fan et al. (2012b) which considers a large scale multiple testing problem 
under arbitrary dependence of test statistics. By Principal Factor Approximation, the dependent 
test statistics Z = (Zi, • • • , Zp)'^ ~ -^(Mi S) can always be decomposed as 

Zi = fii + bJW + Ki, 

where bi is the ith row of the first k unstandardized principal components, denoted by -B, of Xl 
and K = {Ki, - ■ ■ ,Kp)^ ~ N{0,A) with A = S - BB^. The common factor W drives the 
dependence among the test statistics. This realized but unobserved factor is critical for False 
Discovery Proportion (FDP) estimation and power improvements by removing the common factor 
{bfW} from the test statistics. Hence, the goal is to estimate W with given {bi}f^^. In the 
multiple testing problem, the parameters {fii}^^^ are sparse. The choice of k is to make A weakly 
dependent. Replacing Zi, Hi, bi, W, k, p, and Ki with Yi, fi*, Xi, (3* , d, n, and respectively, we 
obtain model (1.1). 

Although model (1.1) emerges from a critical component of estimating FDP in Fan et al. 
(2012b), it possesses its own interest. For example, in some applications, there are only few signals 
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(nonzero /i*'s) and we are interested in learning about which reflects the relationship between the 
covariates and response. For another instance, those few nonzero ^*'s might be some measurement 
or recording errors of the responses {Yi}. In this case, model (1.1) is suitable for modeling data 
with contaminated responses and a method producing a reliable estimator for (3* is essentially a 
robust replacement for ordinary least squares, which is sensitive to outliers. 

Several models with such a mixed parameter structure have been first studied in a seminal paper 
by Neyman and Scott (1948), which points out the inconsistency of classic maximum likelihood 
estimator (MLE) in the presence of a large number of incidental parameters and provids a modified 
MLE. However, their method stops working for our problem due to no exploration of sparsity. 
Kiefer and Wolfowitz (1956) show the consistency of the MLE when high-dimensional incidental 
parameters are assumed to come from a common distribution. Basu (1977) considers the elimination 
of nuisance parameters via marginalizing and conditioning methods and Moreira (2009) solves the 
incidental parameter problem with an invariance principle. For a review of the incidental parameter 
problems in statistics and economics, see Lancaster (2000). 

Without loss of generality, suppose the first s incidental parameters {iJ,*}f^i are nonvanishing 
and the remaining are zero. Then, model (1.1) can be written in a matrix form as 

Y = Xt^ + e, 



. 

X In-s 



where 

X 

Xjj = (Xi, XiJ^i, ■ ■ ■ ,Xj)-^, Ik \s a, k X k identity matrix, is a generic block of zeros and 
u = ■ ■ ■ , , fi*j^i, ■ ■ ■ ,/U*)-^. While this is a sparse high-dimensional problem, the matrix 

X does not satisfy the sufficient conditions in Zhao and Yu (2006) and Fan and Lv (2011) due to 
inconsistency of incidental parameters in v. For details, see Supplement C. 

In this paper, we investigate mainly a penalized estimator of /3* defined through 

n n 

(A,/9) = argmin J^iYi - f^i - Xf pf + Y,px{\fii\), (1.2) 
/^'^ i=i i=i 

where px is a penalty function with a regularization parameter A. Since only the incidental param- 
eters are sparse, the penalty is imposed on them. It will be shown that /3 possesses consistency 
and an oracle property. On the other hand, nonvanishing elements of /i* can not be consistently 
estimated even if /3* were known. So, there is a partial consistency phenomenon. 
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Penalized method (1.2) is a one-step procedure. Alternatively, a two-step method first employs 
the penalized estimation to identify a subset of data with vanishing incidental parameters and 
second estimates /3* with the subset. We will show that the estimator (3 from the two-step method 
is asymptotically equivalent to the one-step estimator (3 for a main situation. Furthermore, (3 has 
fewer possible asymptotic distributions than /3 and thus is more suitable for constructing confidence 
regions for /3*. Indeed, the two-step method improves the convergence rate and efficiency over the 
one-step procedure for challenging situations where the impact of nonzero incidental parameters is 
not negligible for the one-step estimation. 

The rest of the paper is organized as follows. In Section 2, the model and penalized estimation 
method are rigorously defined and the corresponding penalized estimator is characterized. Asymp- 
totic properties on the penalized estimator are derived in Section 3. Then a penalized two-step 
estimator is proposed and its theoretical properties are obtained. We also explicitly characterize 
two important quantities which are crucial for selecting the regularization parameter and boundary 
conditions of the theoretical results and provide a data-driven regularization parameter. In Section 
4, all the previous main theoretical results are extended to the case where the number of covariates 
grows with but slower than the sample size. We present in Section 5 simulation results and analyze 
a read data set. Section 6 concludes this paper and all the proofs are relegated to the appendix 
and supplements. 

2 Model and Method 

The matrix form of model (1.1) is given by 

Y = fi* + X(3* + e, (2.1) 

where Y = (Yi,l2, • • • , ^n)^, X = {Xi,X2, ■ ■ ■ , Xn)'^, and e = (ei, e2, • • • , e^)"^. The covariates 
{Xi}f^^ are assumed to be i.i.d. copies of Xq with mean zero and a positive definite covariance 
matrix of Sx; independent of the random errors {ei}, which are i.i.d. copies of eo with mean zero 
and variance o"^. Suppose further there exist positive sequences k„,7„ <^ ^/n such that 

P( max ||^j||2 > ^^n) and P( max |ej| > 7„) — )• 0, as n — )• oo, (2.2) 

l<j<n l<i<n 

where <C means orderly less than and || • ||2 stands for the Euclidean norm. 

Assume there are three kinds of incidental parameters in model (2.1). The first s\ incidental 
parameters {^*}^]^]^ are large in the sense that » max{fi;n,7n} for 1 < i < si. The next S2 
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ones {/U*}|=s^+i are nonzero and bounded by 7„, with s = si + S2- The last n — s ones 
are zero. It is unknown for us, however, which /i*'s are zero, bounded and large. Without loss 
of generality, the sparsity of /x* is understood by n » si, S2 — )• cxd. Denote these three types of 
incidental parameters by vectors /x^, ^2-, and respectively. 
Penalized least-squares (1.2) can now be written as 

n 

(A,/9) = argmin L(/x,/3), L{^l, (3) = \\Y - ii - X I3\\l + y^px{\^ii\). (2.3) 

The penalty function px can be the soft (Li or LASSO), hard, SCAD or a general folded concave 
penalty function (Fan and Li, 2001). When the penalty function is Li, the loss function is a convex 
function and thus local minimizers are global. To simplify the discussion on the globalness of the 
minimizer we next consider only the soft penalty function, that is, Px{\t^i\) = 2A|/ii|. However, 
with the hard or SCAD penalty function, similar theoretical results can be derived and the only 
difference is that the penalized estimator is interpreted as a local minimizer. 

By subdifferential calculus (see, for example. Theorem 3.27 in Jahn (2007)), it follows a char- 
acterization for the penalized estimator. 

Lemma 2.1. A necessary and sufficient condition for {fi, 0) to be a minimizer of L{^,(3) is that 

Yi - fii - Xjf3 = Xsgn{fii), foriGlo, 
\Yi - Xf0\ < A, for i G i^, 

where sgn(-) is a sign function and Iq = {1 < i < n : fii = 0} . 

The special structure of L{^, /3) strongly suggests a marginal decent algorithm to search for the 
minimizer in (2.3), which computes iteratively 

/xC^) = argmin L(/i, /3^^^-^)) and = argmin L{fi^''\ (3) 

until convergence. The advantage of this algorithm is that there exist analytic solutions of the above 
two minimization problems. They are respectively the soft-thresholding of the data {Yi— Xj (3^''^^^} 
to obtain //('^^ and ordinary least-squares estimator /3^^^ with updated responses Y — ^^^^ . 

In this and next sections, we assume d is a fixed integer. This simplifies the theoretical derivation 
while keeping main messages of this paper. For simplicity of statement, abbreviate "with probability 
going to one" to "wpgl" . A usual stopping rule for the above algorithm is based on the successive 
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difference ||/3^'^~''"^^ — /3^'^^||2. By this rule, wpgl, the iterative algorithm stops at the the second 
iteration, so long as the initial estimator is bounded wpgl (e.g., (3^^^ = 0). 

Proposition 2.2. Suppose there exist positive constants Ci and C2 such that \\(3*\\2 < Ci and 
\\P^^'^\\2 < C2 wpgl. If the regularization parameter X satisfies (3.1), siX/n = 0(1) and S2"yn/n = 
0(1), then, for every K >\ and k < K, with a probability Pn,K increasing to one as n ^ 00, 

_ fsiK) < 0((si/n)^), and h < 2VdCi + C2. 

Remark 1. For any given prespecified critical value in the stoping rule, Proposition 2.2 implies 
that the algorithm stops at the second iteration wpgl. In practice, the sample size n might not be 
large enough for the two-iteration estimator to have a decent performance. By Proposition 2.2, K 
iterations will make the distance — II2 of the small order (si/n)^ . The fast convergence 

of the algorithm has been verified in simulations. 

Suppose has a theoretical limit (3^°°\ corresponding to which, there is a limit estimator 

for fj,*. Then, (/x(°°\ /3^°°-') is a solution of the following system of nonlinear equations 

/3 = (X^X)-iX^(l^-/x), (2.4) 

and, with a soft-threshold estimator applied to each component, 

/X = (|1- - X^/3| - A)+sgn(r - X^/3). (2.5) 

Lemma 2.3. A necessary and sufficient condition for {fJ.,(3) to be a minimizer of L{fi,(3) is that 
it is a solution to equations (2.4) and (2.5). 

By Lemma 2.3, (/i(°°), /3(°°)) must be a minimizer of L(fj,, (3). Thus, without causing conceptual 
confusion, the limit estimator {fj,^'^\ (3^°°^) is still denoted as {fi,(3). 

Note that f3 is also a minimizer of the profiled loss function L(/3) = L{fj.{P), P), where /x(/3) is 
given by (2.5) with dependence on f3 being stressed. Interestingly, this profiled loss function is a 
criterion function equipped with the famous Huber loss function (Huber (1964) and Huber (1973)). 
Specifically, the profiled loss function is 

n 

L{(3) = Y,P{Y^-XJf5), 

i=l 

where /9(2;) =x'^I{\x\ < X) + {2Xx-X^)I{\ x\ > A) is exactly the Huber loss function. The equivalence 
between the penalized estimator and Ruber's estimator indicates that the penalization principle 
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is versatile and naturally induces an important loss function in robust statistics. This equivalence 
gives a formal endorsement of the least absolute deviation (LAD) robust regression in Fan et al. 
(2012b) and indicates that they could use all data points with LAD regression rather than 90% 
of them. It is worthwhile to note that although the penalized estimator is exactly the Ruber's 
estimator for /3*, model (2.1) contains the additional sparse incidental parameter /x*, compared 
with the linear regression model considered in Huber (1973). Recently, there appear a few papers 
on robust regression in high-dimensional settings, see, for example, Chen et al. (2010), Lambert- 
Lacroix and Zwald (2011), Fan et al. (2012a) and Bean et al. (2012). Portnoy and He (2000) provide 
a high level review of literature on robust statistics. Our model is different, however, as we do not 
impose randomness assumption on the "source of outliers" {n*}- 
From the equations (2.4) and (2.5), /3 is a solution to 

ipn{(3) = 0, where Mf3) = /3 - {X^X)-^X^{Y - /^(/3)). (2.6) 

In general, this is a Z-estimation problem. In the following theoretical analysis, we take this 
characterization of (3. After obtaining f3, we take p, = /^(/3) as an estimator of /x*. 

At the end of this section, we provide some notations and an expansion of (^.„(/3). Let S = 

J27=l^i^L §5 = J2ieS^i^L §5 = EieS^il^i^ = T.teS^i^i^ ^ = ELl and Ss = 

^i^s where S is a subset of {1, 2, • • • , n}. It is straightforward to show 
S(^„,(/3) = (§5x0 + Ssn + Ssi.)(/3 - PI - (Sg,, + %J 

- (§510 + §Sii + Ssia) - KSS20 + Ss2i + '^522 " '^530 " '^Ssi " ^532), (2.7) 

where the index sets S'lo = {s + 1 < i < n : \Xj {(3* — /3) + ei| < A}, = {1 < « < si : 
+ Xf (r - /3) + e^l < A} and = {si + I < i < s : + Xj{(5* - (3) + e^j < A}; ^20, ^21 
and 5*22 are defined similarly except that the absolute operation is omitted and "<" is replaced by 
">"; 5*30, 5*31 and ^32, are defined similarly with 520, 'S'21 and 5*22 except that "> A" is replaced 
by "< —A". Note that all these index sets depend on /3. 

3 Asymptotic Properties 

It is critical to properly specify the regularization parameter A . For the case where the number of 
covariates d is a fixed integer, it is specified as follows: 

<C A, a7„ < A, and A <C min{/x*, \/n}, (3.1) 
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where Kn and 7„ are defined in (2.2), a is a constant greater than 2, and /i* = mini<j<si 1/^*1- 

This specification of A, together with the condition (2.2) on k„ and 7„, sufficiently distinguishes 
the large incidental parameters from others, and thus greatly simplifies the asymptotic properties 
of the index sets Sij's in (2.7) in the sense that, wpgl, the index sets become independent of (3. 
Denote a hypercube of /3* by Bc{l3*) = {/3 € M'^ : -/3*| <C,l<j<d} with a constant C > 0. 

Lemma 3.1 (On Index Sets Sij's). For every C > cind every j3 G Bq(^/3*^ , wpgl, 

5*10 = 'S'lo, = 0, 5*12 = S'12; 5'2o = 0, 5*21 = S21, S22 = 0; 5*30 = 0) 5*31 = S^i,Ss2 = 0, 

where the limit index sets 5*o = {s + 1, s + 2, • • • , n}, 5*2 = {^i + 1, s + 2, • • • ,s}, S21 = 
si : fi* > 0} and S^i = {1 < i < si : fx* < 0}. 

By Lemma 3.1, wpgl, the solution (3 to (2.6) has an analytic expression: 

P = P* + (§5?^ + S^rJ-^S^^, + i^ko + + ^^^^2. - ^s*J], (3.2) 

from which, we derive asymptotic properties of /3. 

The theoretical results in this section are all stated under the specification of regularization 
parameter (3.1). In addition, some results need the following assumption. 

(A) There exists some constant 6 > such that E||Xo|l2^'' < c« and 

s 

IIM2I|2/||M2I|2+5^oo, where WfJ-th+s = { Yl If^il'^^V^^^^^^ ■ 

i=si+l 

The first result is on the existence of a unique consistent estimator of f3*. 

Theorem 3.2 (Existence and Consistency on (3). If either S2 = o{n/ [nn'yn)) or assumption (A) 
holds, then, for every fixed C > 0, wpgl, there exists a unique estimator (3^ € Bc{f3*) such that 
Ai0n) = and (3^ (3*. 

Remark 2. In Theorem 3.2, there are two different sufficient conditions, both of which essentially 
put constraints on the bounded incidental parameters They come from different analysis on 
the term S|^* in (3.2). Each of them does not imply the other. For details, see Supplement E. 

Corollary 3.3. // S2 = 0(n"2) for some 02 € (0,1) and Knjn ^ n^^~"'^\ then the conclusion of 
Theorem 3.2 holds. 
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Next, we consider the asymptotic distributions on the consistent estimator /3„ obtained in 
Theorem 3.2. Without loss of generahty, we assume the sizes of index sets S21 = {l<'i<si: 
fj.* > 0} and = {1 < « < si : ^* < 0} are asymptotically asi and (1 — a)si with a constant 
a £ (0, 1). Similar to Theorem 3.2, there are two different sufficient conditions. For clarity, we 
present the asymptotic distributions on /3„ separately under each sufficient condition. Let ~ stands 
for asymptotic equivalence. 

Theorem 3.4 (Asymptotic Distributions on (3). Under the condition S2 <^ \Ai/(Kn7n); 

(1) ifsi < n/X^, then - p") ^ N{0, (T^Xl^^); [main case] 

(2) if si ~ 6n/A2, then - N{0, {b + a'^)'E~^), for every constant 6 G R+; 

(3) if si > n/\^, then rn0n - P*) ^(0, S^^), where r„ = n/(A^). 

Remark 3. Note that the constant a does not appear in the limit distributions of Theorem 3.4 due 
to cancelation and that the suh-^/n consistency emerges in case (3) when si is large, because the 
impact of the large incidental parameters is too big for the one-step procedure to handle efficiently. 
For the second case, as 6 — )• 0, its condition and limit distribution become those of case (1). In the 
other direction, as b grows large, it approaches case (3). 

The main case of Theorem 3.4 leads to a simple corollary. 

Corollary 3.5. Suppose A <^ n"^ and Unln ^ n-^^ for some ai £ (0,1) and 02 G (0,1/2). // 
si < n^-"! and S2 <C n^/^-^a^ ^j^^^ V^i(3^ - /3*) N{0, cr^S^^^). 

It will be shown in Section 3.2 that the conditions on K^^n in Corollaries 3.3 and 3.5 are usually 
satisfied for typical covariates and random errors. 

Under moment assumption (A), there are more possible asymptotic distributions for /3„. 

Theorem 3.6 (Asymptotic Distributions on f3). Let Dn = ||M2II2- Under assumption (A), for all 
constants b,c £ M"*", 

(1) when si < n/A^ and Dl/n = o(l), V^(/3„ - /3*) iV(0, (t'^T,^); [main case] 

(2) when si <t. n/X^ and Dl/n ~ c, ^f^{(3^ - (3*) N{0, (c + a^)T:^^); 

(3) when si < n/A^ and D^Jn 00, rn{Pn ~ f^*) ^(0, ^x^); where rn ~ n/Dn < \/n; 

(4) whensir^bn/X^ and Dl/n = o{l), ^0^-/3*) N{0,{b + a'^)'S]:}); 

(5) when si ~ bn/X^ and Dl/n ~ c, \/ri(^„ - /3*) iV(0, (6 + c + a^)!]-!); 

(6) when si ~ bn/X? and Dl/n — >• 00, rn0n ~ f^*) ~^ -^(0, '^^), where rn ~ n/Dn ^ \/n; 
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(1) when s\ > nj}? and D'^/n = o(l) or D'^/n ~ c, rn{Pn ~ Z^*) ^i^^'^x^)^ where r„ ~ 

n/ (Ay^) <^ yjn; 
(8) when si ^ n/}? and D^/n — >■ co, letting Vn ~ niin 

(8a) if y/hn/{\Vsi) > n/Dn, then rn{Pri - P*) T,'^)] 

(8b) if Vbn/{XV^) ~ n/Dn, then r„(^„ - /3*) A A^(0, (1 + 6)S-i); 

^5c; i/\/6n/(A^) < n/Dn, then r„(^„ - A Af(0,65]^^). 

Remark 4 (An Oracle Property). Suppose an oracle tells the true fi*. Then, with the adjusted 
responses Y — fi*, we obtain by least-squares an oracle estimator of f3*, which is given by = 
{XX'^)-^X'^{Y - n*). The limiting distribution of ^/n(/3lf^ - p*) is N{0, a'^'S]^^). Comparing 
this with the main cases in Theorems 3.4 and 3.6 and Corollary 3.5, it is clear that the penalized 
estimator /3„ enjoys an oracle property when conditions are met. 

Although mainly interested in the estimation of /3*, we also obtain the soft-thresholding esti- 
mator fi of for each i, 

(li = fiiip) = {\Yi - Xj^\ - A)+sgn(yi - Xf^). (3.3) 

Denote £ = {jli ^ 0, for i = 1, 2, • • • ,si; and jli = 0, for i = si + 1, si + 2, • • • , n}. 

p 

Theorem 3.7 (Partial Selection Consistency on jl). If (3 — > (3* , then P{£) — )■ 1. 

Remark 5. By Theorem 3.7, wpgl, the indexes of ^\ and /ig are estimated correctly, but those of 
jjL^ wrongly. When both the size and the magnitude of the elements of ^2 limited, incorrectly 
estimating zero asymptotically has ignorable negative effect on the penalized estimator. 

3.1 Two-step Estimation 

Theorems 3.4 and 3.6 show that /3„ has multiple different limit distributions so that a wrong one 
might be used when we construct Wald-type confidence regions for /3*. In addition, /3„ is inefficient 
or rate-suboptimal in the more challenging cases where the impact of large and bounded incidental 
parameters is not ignorable. To address these two issues, we introduce a two-step method. After 
applying the penalized estimation (2.3) and obtaining /i, let Iq = {1 < « < n : /ij = 0}. Then, the 
two-step estimator is given by 

^ = (^l^/o)"'^/o^/o' (3-4) 
where X consists of XiS whose indexes are in Iq and Y consists of the corresponding l^'s. 
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Theorem 3.8 (Consistency and Asymptotic Normality on (3). If either S2 = o{n/{Knln)) or 
assumption (A) holds, then — ^ /3*. // S2 = o{^/n / {Kn"fn)) , then y/n(p — (3*) A^(0, cr^S^f^). 
On the other hand, under assumption (A), 

(1) ifDl/n = o{l), then ^0 - /3*) N{0,a^^]^^); [main case] 

(2) if Dl/n ~ c, then ^/n{p - (3") iV(0, (c + o-2)S^^), for every constant c G M+; 

(3) if Dl/n oo, then rn0 - 13*) N{0, S^^) where rn ~ n/D„ < ^/n. 

Compared with Theorems 3.4 and 3.6, the number of possible limit distributions of /3 is reduced 
to at least one third since the conditions on si are not required because of the partial selection 
consistency property from Theorem 3.7. Further, the two-step estimator improves the convergence 
rate over the one-step estimator for those challenging cases. 

By the main cases of Theorems 3.4, 3.6 and 3.8, we can construct Wald-type confidence regions 
for P*. For example, by Theorem 3.8, a confidence region with asymptotic confidence level 1 — a 
is given by 

{/3 e : a~^V^\\J:]{\~P - (3)\\2 < QaiXd)}, (3.5) 

where qaixd) is the upper a-quantile of Xd, the square root of the chi-squared distribution with 
degrees of freedom d. For each component /3* of (3* , an asymptotic 1 — a confidence interval is 
given by 

[/3,-±n-i/2aS^i/2^j,j>„/2], (3.6) 

where T,^ is the square root of the entry of and is the upper a/2-quantile 

of A^(0, 1). The confidence region (3.5) and interval (3.6) involve unknown parameters Xlx and a. 
They can be estimated by X)x = {l/n)X'^X and 

a = #(/o)-'/'||l^|„-XT^||2. (3.7) 

By the law of large numbers, is consistent. On the other hand, a is also consistent. 

Lemma 3.9 (Consistency on a). Suppose S2 = o{n/{Hinln)) or assumption (A) holds. If S2 = 
o{n/jl), then a a. 

Thus, after replacing Sx and a in the confidence region (3.5) and interval (3.6) with Ylx and 
a, the resulting confidence region keeps the asymptotic confidence level 1 — a. 
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3.2 Regularization Parameter 

The regularization parameter A is determined by k„ and 7„, which are also crucial to the boundary 
conditions of the asymptotic properties of the penalized estimators /3 and (3. By condition (2.2), 
Kn and 7n depend on the distributions of Xq and eo, respectively. It is of interest to explicitly 
derive k„ and 7„ under some typical assumptions on the covariates and errors. Next, we consider 
three typical cases: the first two are on Gaussian or bounded random variables and the last one is 
on general random variables. A special case with exponentially tailed random variables is provided 
in Supplement E.2. 

First, consider the case where the covariates are bounded with Cx > and the random errors 
follow A^(0,fT^). Let Kfi = VdCx and jn = y^2(T^ log(n). They satisfy condition (2.2). Then the 
specification of the regularization parameter (3.1) becomes a^Jla'^ log(n) < A ^ min{/i*, \/n\. 

Second, consider the case with both Gaussian covariates and Gaussian errors. That is, X.q and 
eo follow A^(0, Xlx) and A^(0,cr^), respectively. Denote by a\ the maximum of diagonal elements 
of Tix- We can take k„ = \j 2daj^ log(n). Then, the specification (3.1) becomes y^log(n) ^ A ^ 
min{/i*, \/n}. 

Third, consider the case with general covariates and general errors. For convenience, we first 
introduce the definition of Orlicz norm and related inequalities. For a strictly increasing and convex 
function ip with ip^O) = 0, the Orlicz norm of a random variable Z with respect to tp is defined as 

\\Z\\^ = inf{C > : EiP{\Z\/C) < 1}. 

Then, for each x > 0, 

P{\Z\> x) <l/i,{x/\\Z\\^). (3.8) 

(See Page 96 of van der Vaart and Wellner (1996)). In addition, by Lemma 2.2.2 on Page 96 of 
van der Vaart and Wellner (1996), for ^p satisfying limsup^ y^g^ ilj{x)'4j{y) / '4){cxy) < oo with some 
constant c, 

II max .^ill^ < K'il)~'^{n) max ||Zj||^, 

l<j<n l<j<7i 

where ii' is a constant independent of the random variables and the sample size and '0~^(") is the 
inverse function of Combining the above two inequalities, it follows 

P{\ max Zi\ > x)< l/'4){x/{K^-^{n) max ||Zi|L)). (3.9) 

l<i<n l<j<7i 

By (3.9), a sufficient condition for (2.2) is that k„ and 7„ satisfy 

Kn'>i^~^{n) &nd-in'>i^'^{n). (3.10) 
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For example, if ipq{x) = e^'' — 1 with q > 1 and ||eo||i/)q + Yl'j=i ll^Ojllvq < then, by (3.10), a 
sufficient condition for (2.2) is mm{Kn,jn} ^ (logn)^/'^. Thus, the specification (3.1) becomes 



(log(n))^/''r„ < A < min{^*, ^/n}, 

where t„ is a sequence diverging to oo as slowly as possible. 

Besides the theoretical specification for A, a data-driven regularization parameter is helpful in 
practice. A popular way is to use multi-fold cross-validation, but the validation set needs to be 
made as little contaminated as possible. We propose the following procedure and its performance 
will be demonstrated in Subsection 5.2. 

[Procedure for Data-driven Regularization Parameter] 

1. Apply the OLS with all the data and obtain residuals ^^^^^^ = Yi — Xjp!"^ for each i. 

2. Identify the set of "pure" data corresponding to the Upure smallest values in { | e^^'^^"^'' | } . 

3. Compute the updated OLS estimator ' with the "pure" data and obtain updated 
residuals {e^^'^^'^'^^}. 

4. Identify the updated "pure" data with the Upure smallest {|q |} and the remaining as 
"contaminated" ones. 

5. Randomly select a subset from the updated "pure" set as a testing set and the remaining 
"pure" an "contaminated" sets are merged into a training set. 

6. For each grid point of A in an interval [A^, Xu], apply a penalized method to the training set 
and obtain the estimator (3\^train- 

7. Identify the optimal grid point Xopt, which minimizes al^^^^ = Etcsting set(^i " ^JP\trainf- 

The interval [A^, Ac/] in Step 6 can be specified as follows: 

a. Obtain the qih quantile q{e) of {le^'^^'^'^^l}, for a large q. 

b. Compute the standard deviation dpure of residuals |g('^^'^'2)|^p^'-e_ 

c. Set Al = aio'pure and Xfj = q{e), where ol is a positive constant such that Xl < Xfj. 

4 Diverging number of structural parameters 

In Sections 2 and 3, we have considered model (2.1) under the assumption that the number of 
covariates d is a fixed integer. However, when there are a moderate or large number of covariates. 
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it is more appropriate to assume that d diverges to infinity with the sample size. In this section, 
we consider model (2.1) with the assumption that d — t- oo and d <^ n. 

Since the number of covariates grows orderly slower than the sample size, it is appropriate to 
proceed to utilize the penalized estimation (2.3) for (/i*,/3*) and the penalized two-step estimation 
(3.4) for /3*. The corresponding estimators are still denoted as {fi,P) and (3, but we should keep 
it in mind that their dimensions diverge to infinity with n. The characterizations of (3 in Lemmas 
2.1 and 2.3 are still valid since they are finite-sample results. The iteration algorithm also wpgl 
stops at the second iteration, which is supported by an extension of Proposition 2.2 provided in 
Supplement F. 

As before, it is critical to properly specify the regularization parameter A . For the case with a 
diverging number of covariates, it is specified as follows: 

VdKn < A, a^n < A, and A < /x*, (4.1) 

where Kn and 7„, are defined in (2.2) and a > 2. Comparing it with the previous specification (3.1), 
the main difference in formation is that k„ is changed to VdKn- In fact, k„ in (4.1) also depends 
on d, which will be shown in Supplement E.2. This difference highlights the assumption that d 
diverges to oo. With (4.1), the conclusion of Lemma 3.1 on the index sets continues to hold. 

Lemma 4.1 (On Index Sets Sij's). For model (2.1) with d — ?■ oo and d <^ n, if the regularization 
parameter A satisfies (4-i), then the conclusion of Lemma 3.1 holds. 

Thus, wpgl, still valid is the crucial analytic expression of (3 (3.2), from which we derive its 
theoretical properties. These properties are essentially parallel to those of the previous case with a 
fixed with additional technical complexity caused by the diverging dimension d. 

Before stating theoretical results, we list some technical assumptions on the covariates. Denote 
IMIf,^ = ^"^''^IMI-F) where is the Frobenius norm, and the average of the square root of the 
fourth marginal moments of Xq as kx = d~^ Sj=i(IE[^oj])"^''^- 

(Bl) 1 1 1] II is bounded. 
(C) Kx is bounded. 

Theorem 4.2 (Existence and Consistency on (3). Suppose assumptions (Bl) and (C) hold. If 
there exists r^, a sequence of positive numbers depending on d, such that 

d^/n^O, {rdd)^/n^O, si = o{n/{rdVdKnX)) and S2 = o{n/ {rdVdKnjn)), 
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then, for every fixed C > 0, wpgl, there exists a unique estimator /3,„ G Bc{P*) such that 

MPn) = and r^llA™ " /^Ib A 0. 

Next, we consider the asymptotic distribution on (3. Since the dimension of (3 diverges to infinity, 
following Fan and Lv (2011), it is more appropriate to study its linear maps. Let An he a. q x d 
matrix, where g is a fixed integer, G„ = A^A^ with the largest eigenvalue Amax(G„), and Gx,n = 
AnJl'^A^. Denote by Amin(Sx) the smallest eigenvalue of Sx, ^''xmax ~ maxi<j<rf Var[Xoj], 
^'xmin ~ ™iiii<j<d Var[Xoj] and 7x,max = maxi<j<d E|Xoj|^- Abbreviate "with respect to" by 
"wrt". We assume further 

(Bl') Amin(^x) is bounded away from zero, which implies assumption (Bl). 
(B2) is bounded. 

(D) ||A„||i? and Amax(Gn) are bounded and Gx,n converges to a g x g symmetric matrix Gx wrt 

(E) crx,max and 

'~lx,rmn bounded from above and crx,min is bounded away from zero. 

Similar to the main case of Theorem 3.4, a properly scaled 0^ is asymptotically Gaussian. 

Theorem 4.3 (Asymptotic Distribution on /3). Suppose assumptions (Bl'), (B2), (C) and (D) 
and (E) hold. If d^logd = o{n), si = o{'s/n / {X^fdKn)) and S2 = o{\/n / {^/dKn^yn)) , then 

^Aniha - P") A ^^(0, a^Gx). 

With P, an estimator fi follows via (3.3). Next is an extension of Theorem 3.7. 

Theorem 4.4 (Partial Selection Consistency on /x). Suppose is a consistent estimator of (3* wrt 
TdW-h- Ifvd > l/Vd, then P{£) ^ 1. 

From /i, we construct the penalized two-step estimator (3 through (3.4). This two-step estimator 
is consistent (see Supplement F) and its asymptotic distribution, as an extension of the main case 
in Theorem 3.8, is given by 

Theorem 4.5 (Asymptotic Distribution on (3). Suppose the conditions of Theorem 4-3 hold except 
the condition si = o{^/n / {\^/dKn)) ■ Then 

V^AniP - P*) ^ N{0,a^Gx). 
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From Theorems 4.3 and 4.5, Wald-type asymptotic confidence regions of /3* are availabe. For 
example, a confidence region based on /3 with asymptotic confidence level 1 — a is given by 

{(3GR'': a-^M\G2'nAn{fi - /3)||2 < qa{x,)}- (4.2) 

* -1 rp 

Since Gx,n involves the unknown Sx; we estimate it by Gx,n = -^n^x -^n- the other hand, 
a is estimated by a in (3.7) as before. After plugging Gx,n and a into (4.2), we obtain 

{/3 G : a-'V^\\Gx]nAn{P - /3)||2 < QaiXg)}- (4.3) 

Similar to Lemma 3.9, the consistency of a is assured. 

Lemma 4.6 (Consistency on a). Suppose the assumptions and conditions of Theorem hold 
with rd > \/d. If S2 = o{n/j^), then a — ^ a. 

The following theorem, together with Lemma 4.6, guarantees the asymptotic validity of the 

^ -1/2 

confidence region (4.3). However, a stronger requirement on d is required to handle Gxn ■ 

Theorem 4.7 (Asymptotic Distributions on /3 and /3 with Gx.n)- Under the conditions of Theorem 
4.3, ifd^(log{d))^ = o{n), then 

V^G'^]i^A4P - r) ^ N{0,a^I,). 

Similarly, under the conditions of Theorem 4-5, If d^{log{d))'^ = o{n), then 

VnGx]nA4P - 13*) A N{Q,a^I,). 

5 Numerical Evaluations and Real Data Analysis 

The finite-sample performance of the penalized estimators are first evaluated through simulations 
and then a real data set is analyzed. The model for generating data {(Xji, Xi2, li)}"^^ given by 

The sparse incidental parameters {/U*} are the i.i.d. realization of the following mechanism: it takes 
value with probability p^, generates from Wi(c-|- W2) with probability pi, and from uniform over 
[— c, c] with probability p2, where Wi takes values —1 and 1 with probabilities 1 — pw and p^, and 
W2 follows an exponential distribution with mean r. 

In the simulations, without further specification, /3* = = 1, {{Xn, Xi2)^ A^(0,/2), 
independent of {ej} A^(0, 1), and n = 200; po = 0.8, pi = 0.1, p2 = 0.1, c is 0.5, 1, 3 or 5, and 
Pw is 0.5 or 0.75; the repetition number is 1000. 
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Sequence Plot of Incidental Parameters 
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Figure 1: Realized incidental parameters {/^^t with c — 3, Pw — 0.75 and n = 200. 
5.1 Performance of Penalized Methods 

The following methods will be compared: Oracle estimator (O) which knows the index set S of 
those zero /i*'s and is a benchmark that we can possibly mimic, Ordinary Least Squares (OLS) 
which regards /x* as zero, and four penalized methods, namely, Penalized Least Squares with Hard 
Penalty (PLS.Hard or H), Penalized Least Squares with Soft Penalty (PLS.Soft or S), Two-Step 
Penalized Least Squares with Hard Penalty (PLS.Hard.TwoStep or H.TS) and Two-Step Penalized 
Least Squares with Soft Penalty (PLS.Soft.TwoStep or S.TS). The oracle estimator is given by 
^^^^ = {"^-^g XiXj)^^ X^ie5^j(^« ~ l-i-i)- The hard thresholding method refers to the penalty 
PA(|i|) = — {\t\ — A)^{|t| < A} in (2.3) whereas the soft-thresholding method uses the Li penalty. 
Each method is evaluated by the square root of the mean squared error (RMSE). In this subsection, 
each penalized method is evaluated with a range of values of the regularization parameter and we 
examine its performance with the best A. 

Figure 1 shows realized incidental parameters with c = 3 and pw = 0.75. With these 

incidental parameters, RMSE of different estimators of and (3* are shown in the left panel of 
Figure 2. RMSE for is similar to those for because of the symmetry and thus not presented. 
As expected, the oracle method has the smallest RMSE while OLS has the largest. RMSE of 
PLS.Hard with varying A forms a convex shape which achieves the minimal RMSE when A is 
between 2 and 3. On the other hand, RMSE of PLS.Soft decrease a little till A is around 1 and 
then increase. This reflects the fact that a large value of A in a soft-thresholding method would cause 
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Square Roots of MSE with n = 200 , beta1 = 1 , beta2 = 1 
pO = 0.8 , p1 = 0.1 , c = 3 , too = 1 , w.p = 0.75 



Square Roots of MSE witti n = 200 , beta1 = 1 , beta2 = 1 
pO = 0.8 , pi = 0.1 , c = 3 , tao = 1 , w.p = 0.75 
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Figure 2: The left panel shows RMSE for the estimators of /3* and /3* with the incidental parameters 
shown in Figure 1. A horizontal line indicates the (minimal) RMSE for a method. The minimal RMSE for 
each penalized method is the RMSE when A is best chosen. The right panel shows in addition RMSE of 
PLS.Hard.Prac (H.P) and PLS.Soft.Prac (S.P) with data-driven regularization parameters. 

bias. The minimal RMSE of a penalized method measures its performance when A is optimally 
chosen. The minimal RMSE of PLS.Soft is larger than that of PLS.Hard. PLS.Hard.TwoStep has 
very similar performance with PLS.Hard for all A. However, PLS.Soft.TwoStep comes closer to 
PLS.Hard than PLS.Soft. This is because PLS.Hard and PLS.Soft have similar estimation for the 
large incidental parameters when A is large. Overall, the four penalized methods perform similarly 
and their performances are close to the oracle estimator but significantly better than OLS. Table 
1 depicts the (minimal) RMSE of (3 and the corresponding bias and RMSE of /3i and shows that 
the bias contributes little to RMSE. 

O OLS H S H.TS S.TS H.P S.P 

/3i: bias -.00089 -.0063 -.0026 .0050 .0011 -.0022 -.0038 .0020 

/3i: RMSE .080 .148 .088 .093 .087 .089 .111 .113 

^ : RMSE .111 .210 .126 .133 .123 .126 .155 .156 

Table 1: The (minimal) RMSE of (3 and the corresponding bias and RMSE of /3i with the incidental 
parameters shown in Figure 1. 
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Figure 3: RMSE of different estimators with randomly generated fi* under four model settings: c = 1 or 5 
and — 0.5 or 0.75. The same captions as those in Figure 2 are adopted. 

In order to examine the performance of the methods with different incidental parameters, we 
generate n* randomly for each repetition. Figure 3 shows RMSE of different estimators of /3* under 
four model settings: c = 1 or 5 and pw = 0.5 or 0.75. Each plot in Figure 3 has a similar pattern 
with the left panel of Figure 2. When is fixed, RMSE of each nonoracle method increases as 
c increases. This indicates that a nonoracle estimator of (3* becomes worse as the data are more 
"contaminated". However, the penalized estimators are much more robust than OLS. On the other 
hand, RMSE of the penalized estimators and OLS are quite stable with respect to pw Note that 
a penalized method can even outperform the oracle one when c is small. Table 2 contains RMSE 
of O, OLS, PLS.Hard and PLS.Soft under eight settings with c = 0.5, 1,3 or 5 and pw = 0.5 or 
0.75. For each p^, as c varies from 0.5 to 5, RMSE of O is almost constantly around 0.11, RMSE 
of PLS.Hard and PLS.Soft grow only from 0.11 to 0.13, whereas RMSE of OLS increase from 0.11 
to 0.24. 

5.2 Comparison of Data-Driven Methods 

We now illustrate the performance of our procedures when the regularization parameter is chosen 
by the data-driven approach introduced in Subsection 3.2. Since the two-step methods have similar 
RMSE with the one-step methods, only the latter are considered with the data-driven A and denoted 
as PLS.Hard.Prac (H.P) and PLS.Soft.Prac (S.P). Simulations are first run with the fixed incidental 
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RMSE(^) O OLS H S H.P S.P LAD 



Setting 1 .116 .115 .112 .110 .113 .117 .134 

Setting 2 .115 .126 .118 .114 .119 .125 .137 

Settings .113 .172 .133 .134 .155 .141 .155 

Setting 4 .117 .242 .124 .136 .151 .156 .156 

Setting 5 .113 .120 .109 .107 .112 .117 .134 

Setting 6 .116 .124 .116 .112 .123 .126 .141 

Setting 7 .110 .175 .135 .130 .151 .141 .154 

Setting 8 .114 .237 .125 .137 .157 .152 .159 

Table 2: The RMSE of (3 for different methods under eight different settings. In Settings 1 to 4, = 0.5 
and c = 0.5, 1, 3, 5; in settings 5 to 8, — 0.75 and c = 0.5, 1, 3, 5. Note that varies with simulations. 
For a penalized method, the minimal RMSE is reported. 

parameters as showed in Figure 1. For the interval of A, we let = 5 for PLS.Hard.Prac, and 
Xl = 0.5 for PLS.Soft.Prac to guarantee is small enough. In both methods, q is set to be 0.95 
and the size of the testing set is 1/5 of that of the updated "pure" subset, whose 0.7n. 

RMSE of the data-driven penalized methods is plotted along with other methods in the right 
panel of Figure 2. It shows that RMSEs of PLS.Hard.Prac and PLS.Soft.Prac are close to each 
other. They are not as good as PLS.Hard and PLS.Soft with the best A, but significantly better 
than OLS. Table 1 tells that RMSE of estimators of /3* from PLS.Hard.Prac and PLS.Soft.Prac 
are around 0.15, larger than the minimal RMSE from PLS.Hard and PLS.Soft, which are around 
0.12, but significantly smaller than RMSE from OLS, which is around 0.21. 

As before, it is of interest to evaluate the average performance of the data-driven methods with 
respect to different incidental parameters. Figure 4 shows RMSE of the methods under the four 
model settings in Figure 3. The pattern of Figure 4 is similar to that of Figure 3. As expected, 
the data-driven methods are not as good as PLS.Hard and PLS.Soft with best tuning parameters. 
However, they are significantly better than OLS when c is large. Table 2 shows that, as c increases 
from 0.5 to 5, RMSE of PLS.Hard.Prac and PLS.Soft.Prac increases from around 0.11 to 0.15, 
faster than the minimal RMSE of PLS.Hard and PLS.Soft, which are from around 0.11 to 0.12, 
but much slower than RMSE of OLS, which are from around 0.11 to 0.23. Table 2 also contains 
RMSE of the least absolute deviation regression method (LAD) used in Fan et al. (2012b) with all 
but not part of the sample points. LAD performs similarly with PLS.Hard.Prac and PLS.Soft.Prac 
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Figure 4: RMSE of H.P and S.P and other methods under those four model settings in Figure 3. 

when the incidental parameters are large, but not as well as when c is small. So, the data-driven 
penalized methods deliver promising finite-sample performance in terms of RMSE. 

5.3 Confidence Intervals 

We next turn to investigate the finite-sample performance of the confidence interval (3.6) for /3* 
based on the penalized two-step estimator /3j and provide a data-driven variant. 

We first compare the finite-sample approximation of the weak convergence of /3j and /3j though 
QQ plots against their limit distribution A^(0, 1). Simulation settings are as follows: {(Xji,Xj2)} 
are i.i.d. from iV(0, IS^Is); n = 200; p2 = 0.01 and (pi, c) is (0.01, 1) or (0.05, 5); Vw = 0.75; r = 1. 
When c = 1 or 5, A is set to be 2 or 3, respectively. 

Figure 5 shows the QQ plots of l3j and f3j. It shows good normal approximations for these two 
estimators, c, pi and p2 are so small that the nonzero incidental parameters are ignorable. A closer 
inspection on the left panel reveals that the QQ plot of f3j is slightly better than that of f3j. This 
is because A = 2 is too small so that m, the size of Iq, is significantly less than the sample size 
n and f3j uses less informative data than f3j. In the right panel with influential large incidental 
parameters, the QQ plot of f3j obviously deviates from the diagonal line while that of f3j almost 
coincides with it, which illustratively demonstrates the advantage of 13 j in constructing confidence 
intervals. 

The previous comparison suggests to adopt (3.6) as a roubust confidence interval for /3*. After 
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QQ Normal plots with c=5, p1 =0.05 and lambda=3 
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Figure 5: QQ plots of standardized /3j and f3j against their limit distribution Af(0, 1). The red line is 
y = X. The left and right panels are with (c,pi, A) ~ (1,0.01,2) and (c,pi,A) = (5,0.05,3), respectively. 

replacing n with m and plugging in a and o'J^, the square root of the (j,j) element of Sj^ , we 
obtain the Wald-type confidence interval 

[Pj±m-^/^aajh^/2\- (5.1) 

In order to make (5.1) adaptive to data, it remains to specify a data-driven A. The choice of A in 
Subsection 3.3 for minimizing RMSE is no longer suitable for constructing confidence intervals. We 
propose to first implement the first four steps of the specification procedure in Subsection 3.3, then 
compute the standard error of {gj'^^'^'^)} with the indexes corresponding to the "pure" subset, and 
finally let A be six times of the standard error. 

The simulation settings for the data-driven interval (5.1) with a = 0.05 are the same to the 
previous ones except that c = 5 and both probabilities pi and p2 vary. Table 3 shows the coverage 
rates of (5.1) for (3* and the coverage rates larger than 0.93 are boldfaced. It is clear that the 
coverage rates are close to the nominal level 95% when the proportions pi and p2 are small. As 
Pi or p2 increases, the coverage rates decreases. However, the coverage rates are more sensitive to 
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.03 
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.09 
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92.2 


92.2 


93.0 


91.3 


90.4 


91.4 


85.8 



Table 3: Coverage rates (CR) of 95% confidence intervals (5.1) for /3jf with different values of model 
parameters pi and P2- The coverage rates greater than or equal to 93% are in boldface. 
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Figure 6: Discovery number, estimated false discovery mimber and estimated false discovery proportion as 
functions of threshold for populations CEU, JTPCHB and YRI. The x-axis is — log]^o(i). 

the change of p2 than pi. This is actually a reflection of, for example. Theorem 3.8, which requires 
additional conditions on the size S2 but not on si only if si = o(n). 

5.4 Real Data Analysis 

We now implement the penalized estimation with the Li penalty and data-driven regularization 
parameter in the multiple testing procedure proposed by Fan et al. (2012b) for studying the as- 
sociation between the expression level of gene CCT8, which is closely related to Down Syndrome 
phenotypes, and thousands of SNPs. The data set consists of three populations: 60 Utah residents 
(CEU), 45 Japanese and 45 Chinese (JPTCHB) and 60 Yoruba (YRI). More details on the data 
set can be found in Fan et al. (2012b). 

In the testing procedure by Fan et al. (2012b), a filtered least absolute deviation regression 
(LAD) is exploited to estimate the loading factors with 90% of the cases (SNPs) whose test statistics 
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Population t R{t) FDP(t) with LAD FDP(t) with S.P 



CEU 6.12 X 10-"^ 4 .810 .845 

JPTCHB 1.51 X IQ-^ 5 .153 .373 

YRI 2.54 X 10-9 2 .227 .308 

Table 4: Discovery numbers and estimated FDPs with LAD and S.P for specific thresholds. 

are small and thus the resulting estimator is statistically biased. We upgrade this step with S.P 
described in Subsection 5.2 and re-estimate the number of false discoveries V{t) and the false 
discovery proportion FDP(t) as functions of — logio(i)> where i is a thresholding value. Figure 6 
shows the number of total discoveries R{t), V{t) and FDP(t) from procedures using filtered LAD 
and S.P. It is clear that V{t) and FDP(t) with S.P are uniformly larger than but reasonably close 
to those with filtered LAD. Table 4 contains R{t) and FDP(t) with filtered LAD and S.P for several 
specific thresholds. The estimated FDPs with S.P for CEU and YRI are slightly larger than those 
with LAD and FDP for JPTCHB with S.P is more than double of that with filtered LAD. This 
suggests that the estimation of FDP with filtered LAD might tend to be optimistic. 

6 Conclusion 

This paper considers the estimation of a structural parameter with a fixed or diverging dimension in 
a linear regression model with the presence of high-dimensional sparse incidental parameters. For 
exploiting the sparsity, we propose a method penalizing the incidental parameter. The penalized 
estimator of the structural parameter is consistent and it has asymptotic Gaussian and achieve 
an oracle property. On the contrary, the penalized estimator of the incidental parameter possesses 
only partial selection consistency but not consistency. Thus, the structural parameter is consistently 
estimated while the incidental parameter not, which presents a partial consistency phenomenon. 
Further, in order to construct better confidence regions for the structural parameter, we propose 
a two-step estimator, which has fewer possible asymptotic distributions and can be asymptotically 
even more efficient than the previous one-step estimator. 

Simulation results show that the penalized methods with best regularization parameters achieve 
significantly smaller mean square errors than the naive ordinary least squares method that ignores 
the incidental parameters. Also provided is a data-driven regularization parameter, with which 
the penalized estimators continue to significantly outperform the naive ordinary least squares when 
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incidental parameters are too large to be neglected. The advantage of the confidence intervals 
based on the two-step estimator is verified by simulations. A data set on genome-wide association 
is analyzed with a multiple testing procedure equipped with a data-driven penalized method and 
false discovery proportions are estimated. 

Although this paper only illustrates the partial consistency phenomenon of a penalized estima- 
tion method for a linear regression model, such a phenomenon shall universally exist for a general 
parametric model, which contains both a structural parameter and a high-dimensional sparse inci- 
dental parameter. Further, if the structural parameter has a dimension diverging faster than the 
sample size and is sparse, it is expected that the partial consistency phenomenon will continue to 
appear when sparsity penalty is imposed on both the structural and incidental parameters. 

A Appendix 

In this appendix, we only prove the theoretical results in Section 4 to save space. The proofs of the 
results in Sections 2 and 3 are provided in supplements. 

Denote §k,i = S{k,k+i,- ,1} ^nd S^^^ = §1^^^^;^ ... ;|. Let B = {maXs+i<i<n\\Xi\\2 < Hn} and 
^ = nr=i{-7n < < 7n}. Then P{B) ^ 1 and P{V) ^ 1 by (2.2). 

Proof of Lemma 4-.1. We first consider S'jo's, then S'ii's, and finally 5i2's with i = 1,2, 3. 

On Sio, S20 and Sso- Let A = {Sw = S^q}. Note that P{A) > P{A\B)P{B) and P{B) 1. 
It suffices to show that P{A\B) — )• 1. By noting A ^ ^fdKn, we have 

P{A\B)>P{{s + l<i<n: -X+ max ||Xi||2\/dC < < A - max \\X,\\2^/dC] SIq\B) 

s+l<j<n s+l<.i<.n 

>P{{s + 1 < i < n : -A + KnVdC < < A - K„\/dC} D S{q) > P{V) 1. 

Thus, wpgl, 5io = S*Q. From Sw U 52o U ^30 = S{q, it follows that, wpgl, ^20 = •S'so = 0. 

On S21, 5*31 and ^n. Recall that /i* = min{|/x*| : 1 < i < si} and note that A — /i* + \/dC'K„ < 
— 7„ when n is large. Let 52ii = 5'2iS'2i and S'212 = >S'2i5|p We will show P{S2ii = S21) — >• 1 and 
P(52i2 = 0) ^ 1. Then P(52i = S^J ^ 1. 

Denote ^1 = {52ii D 5'2i}- Oii the event B, 

S211 D {1 < i < si : > A - /i* + \fdCKn and /U* > 0} D {1 < i < si : ej > -7„ and ;U* > 0}. 

Then, P{Ax) > P{Ai\B)P{B) > P{{1 < i < si : > -7„ and /x* > 0} D S^^)P{B) ^1-1 = 1. It 
follows that, wpgl, 5211 D 5*21. Note that 5'2ii C ^21. Then, wpgl, 5*211 = 52i. 
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Denote A2 = {S212 = O}- On the event B, 

5*212 C {1 < i < si : ej > A + /i* - VdCnn and < 0} C {1 < i < si : ej > jn}- 

Then, P{A2) > PiA2\B)P{B) > P({1 < i < si : > 7„} = 0)P(i3) ^= 1. Then, wpgl, ^212 = 0- 
Thus, P{S2i = S21) — ?• 1. Shnilarly, we can show, wpgl, ^31 = 5*3;^. Note that 5ii, S21 and S31 

are disjoint and their union is S21 U S^i- Then, wpgl, 5ii = 0. 

On S12, S22 and 532. Denote A = {S12 = 5'^2}- Note that —A — fi* + VdCKn < —jn and 

A — /i* — VdCKn > jn when n is large for si + 1 < i < s. On the event B, 

S12 D {si + I < i < s : -X - fi* + VdCKn < ei < X - fJ^t - \^CKn} 
D {si + l<i<s : -7n <ei< 7„}. 

Then, P{A) > P{A\B)P{B) > P({si + 1 < i < s : -7„ < < 7„} = Sj^2)-P(^) ^ 1- Thus, 
wpgl, 5i2 = Si2- Note that 5i2, 5*22 and S'32 are disjoint and their union is 5*2- Then, wpgl, 

^22 = ^32 = 0. □ 

Before proceeding to the proofs of Theorems 4.2 to 4.7, some notations and assumptions are 
needed. Let = (1/d) X]j=i Var[Xoj] and cx^jj^ = (1/d^) Ylk=i X]f=i Var[XofcXo;] and we assume 

(CI) fj^ is bounded. 
(C2) (Txx is bounded. 

Assumption (C) in Section 4 imphes assumptions (CI) and (C2) by Cauchy-Schwarz inequahty. 
For simphcity, we adopt the notation <, which means the left hand side is bounded by a constant 
times the right, where the constant does not affect related analysis. Below are three lemmas needed 
for proving Theorems 4.2 to 4.7. Their proofs are in Supplement F. Suppose that M and E are 
matrices and ||-|| is a matrix norm and that is a sequence of random d x d matrices and A a 

deterministic d x d matrix, and denote = the sample covariance matrix. 

Lemma A.l (Stewart (1969)). = 1 and ||A^~^|| < 1, then 

\\{M + E)-^ - M-'^W ^ \\M-'^\\\\E\\ 

Lemma A. 2. // ||A^ Wp^d is bounded, An — > A, and > l/yd, then A~ — > A , where the 
convergence in probability is wrt rd\\-\\F- 
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Lemma A. 3. If assumption ( C2) holds and r'^d'^/n — )• 0, then 5]„ — > "Sx wrt rd\\-\\F- 

Proof of Theorem 4-2. By the proof of Lemma 4.1, wpgl, the solution /3„ to ipn{P) = on 13c{(3*) 
is exphcitly given by 

P^ = I3* + Tq-i (Ti +T2 + T3- T4), 

where To = (l/n)Ss,+i,„, Ti = (l/n)S^*_^, T2 = (l/n)S|^^i_„, T3 = {X/n)Ss*^ and r4 = {X/n)Ss*^. 

Then, r^WP^ — /3*||2 < ll^o""'"llF,(i X]i'=i '"dV^ll^db- We will show that ||Tg~"'"||^^(i is bounded by a 

positive constant wpgl and rd-v/d||7i||2 for z = 1,2,3,4. Then, raW^n ~ P*\\2 = c'p(l). 

On Tq. By Lemma A. 3, ||To — XlxH^.d — > under assumption (C2) and the condition 

d^/n —7- 0. Then, by Lemma A. 2, together with assumption (Bl), ||Tq""'^ — S^^'^H^f^ — ^ 0. This 

implies that, wpgl, ||rQ~^||i?^rf is bounded by a positive constant. 

On Ti. Wpgl, rd\/(i||ri||2 < rd^fds2Hnln/n = o(l) for S2 = o{n/{rdVdKnJn))- 

On T2. For any 6 > 0, > 5) < {l/5^)P\\{l/n)Zts^+i < dcj^a\/{n5^), where 

a\ = (l/(i)Ei=if^|- Thus, P{rd^\\T2\\2 > < rjd^f^^^l/inS^) ^ by assumption (CI) and 

{rddf/n ^ 0. 

On T3 and T4. Wpgl, ?'d\/d||T'3||2 < rdVdXsiKn/n = o(l) for si = o{n / {r d'/dXnn)) ■ Similarly, 
rd^\\Ti\\2 = op{l). □ 

Below is a lemma needed for proving Theorem 4.3 and its proof is in Supplement F. Suppose 
{^j} are i.i.d. copies of a d-dimensional random vector with mean zero. Denote c|jnax ~ 
maxi<j<d Var[^Oi], o-|,min = mini<j<d Var[^oi] and 7g,max = maxi<j<rf E|^oil^- 

Lemma A. 4. Suppose (T^,max 7^,max are hounded from above and cT^,max hounded from zero. 
If d = o{y/n), then 

1 " 

-^Y^^, = Op{^/d\ogd) wrt IHI2. 
Proof of Theorem 4-3. We reuse the notations Tj's in the proof of Theorems 4.2, from which, 

VnAniPn - P") = Vi + V2 + Vz- Va. 

where Vi = BnTi for i = 1,2,3,4 and = ^/nAnT^^. It is sufficient to show that V2 — — ?• 
A(0,a2Gx) and other Vi's are op(l). 

On Vi. We have ||Vi||2 < -v/n(i|| A„||p||rQ~"^||p_rf||ri||2. By assumption (D), ||A„||p is bounded. 
By Lemmas A. 2 and A. 3 and assumption (Bl), for d = o(n^/^), wpgl, IITq^^IIf,^ is bounded. We 
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have, wpgl, ||ri||2 < s-it^nlnln. Then, ||Vi||2 < Thus, ||Vi||2 = op(l) for S2 = 

oi^/n/iVdKn'jn))- 

On V2. We have V2 = V21 + V22, where V21 = y/nAnTr^T2 and V22 = V^AniT^^ - ^^)T2. 
First, consider V21. We have 

n 

V21 = \/{n- si)/n ^ Zn,', 



'.,11 

j = Sl + l 



where Zn,i = (1/^"^ - si)An'S^^Xiei. On one hand, for every 5 > 0, Y17=si+iM^n,i\\2{\\Zn,ih > 
6} <{n- si)E||Z„,o|lt/<5^ and 



Then, by assumptions (Bl'), (C) and (D) and for d = o(^), Y17=si+iMZn,iM{\\Zn,i\\2 > S} 
0. On the other hand, Y17=si+i Cov(Z„^j) = a'^An'S^^A^ — )• a'^Gx by assumption (D). Thus, by 
central hmit theorem (see Proposition 2.27 in van der Vaart (1998)), ^21 N{0,a^Gx)- Next, 
consider V22- We have 

||1^22||2 < \\AjF{dlog{d))'/^TQ' - ^^''\\F{dlog{d))-'/^V^T2\\2. 

By assumption (D), is 0(1); by Lemmas A. 2 and A. 3, ((ifog(d))^/^||rg"^ - is op(l) 

for dHog{d) = o{n); by Lemma A.4, (dlog(d))-i/2||^r2||2 = (dfog(d))-i/2||_i=S|^^^ Jj^ is Op(l) 
for d = o{^/ri). Then, V22 — > 0. Thus, by slutsky's lemma, V2 — > N{0,a'^Gx)- 
On V3 and V4. First consider V3. By noting that si = o{^/n / {X^fdKn)) , wpgl, 

IIF3II2 < ^\\An\\F\\T^^FA\T?,h < VdXsiKjV^ ^ 0. 

Thus, \\V3\\2 = op{l). In the same way, IIV4II2 = op{l). □ 

Proof of Theorem 4-4- By the definition of £, we have P{£) = T1T2T3, where Ti = P{C\^^i{\f^i + 
Xf{f3*-^)+e,\ > A}), T2 = P{nLs,+im+Xf{f3*-^)+e,\ < A}) andTg = PiOl^^^Al^I {(3* - 
(3) + €i\ < A}). We will show that each Tj converges to one. Then, P{£) — )• 1. Denote C = 
~ P*\\2 ^ 1}- Then P(C) — t- 1 since /3 is a consistent estimator of /3* wrt r'(j||-||2. 
On Ti. We have 1 — Ti < Tn + T12, where 

^11 = P{[jM + XjiP* - ^) + e,| < A}), T12 = P( U {I/.* + XjiP^ ~f}) + e,| < A}). 

iSS2i *S53]^ 
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It is sufficient to show that both Tn and T12 converge to zero. By Vdun ^ A ^ fi*, 
Tn<P{ U {ei <X-iJ* + \\XiW2-0-(3*\\2},C) + P{n 

<P{\J {ei<X-l^* + VdK„}) + P(C^) < siP{eo < -7n} + P(C") 0. 

Similarly, T12 — )• 0. Thus Ti — )• 1. 

On T2 and T3. By ajn < A and VdKn ^ A, 

T2 > P( fl {-A -f,t + (l/rd)K„ < ei < A - //^ - (l/rrf)K„},C) 

i=si 

s s 
>P{f] {-A -f^t + ^dKn <ei<X-^4- ^dKn],C) > P( fj {"Tn < < 7n},C) ^ 1. 
i=s\ i=s\ 

Then r2 1. Similarly, T3 1. □ 
Proo/ 0/ Theorem 4.5. We have V^A„eI/^(^ - /3*) = Pi + P2 + + 1^2, where Pi = V^A^Pi, 



P2 = ^AiA„P2, Pi = iXlXf^r^XlYj^iio + /o}, P2 = -(X\Xi,)-^X\Yi,{h / /o}, 
and Fi's are defined in the proof of Theorem 4.3. Since P(||Pi||2 = 0) > P{/o = -^0} — ^ 1, 
we have Pi = op(l). Similarly, P2 = op(l). By the proof of Theorem 4.3, V\ = op(l) and 
V2 — ^ A^(0,(T^Gx)- Therefore, the desired result follows by Slutsky's lemma. □ 

Proof of Lemma 4-6- Since the assumptions and conditions of Theorem 4.2 hold with r^ > ^/d, the 
penalized estimators P and P are consistent estimators of /3* wrt \/d|H|2 by Theorems 4.2 and F.4 
in Supplement F. Let A = {Iq = Iq}- Then A occurs wpgl by Theorem 4.4. 

We have = TA + a^A", where T = (n - .si)-'^\\Yi^ - Xj^PHj. It suffices to show that 



^2. Note that T = Eti^i, where Ti = (n - s^)-^ Zts.+ii^I iP'' " P)?^ ^2 



in 



sir'T^s^+i^, T3 = 2(n-.iriEr=s,+i^f(/3^-3K> t, = (n-^ir^EU+i^f, n = 

2(n - si)-i J2Ls,+i f^i^Iif^"' - P) and Tg = 2{n - si)-i /"^e*- I* is clear that A a^. 

Thus, it is sufficient to show other T,'s are op(l). 

On Ti. For every rj > 0, wpgl, \/d||/3* — P\\2 < Then, by assumption (CI), wpgl, 

|Ti|<^^ X; ll^flli(v^ll/3*-^l|2)^<2r?^^E||X^||i = 2r,24<^^. 

i=si+l 

On Ts. For every r/ > 0, wpgl, 

11" 1 
|r3|<2-= V ||Xfei||2Vd||/3^-3||2<4?7— =E||X;^eo||2 = 4c7r?^x<r/. 
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On For sa = 0(^/72), |r4| < (n - si)-^S27n ^ 0. 
On T5. For S2 = oiVdn/ijni^n)), 

11 11 

IT5I < 2— S27nK„\/d||/3* - M2 < 2r/^ saTn^n A 0. 

On Te. For S2 = o(n/7n), wpgl, \Te\ < 4^7nS2E|eo| ^0. □ 

Below is a lemma needed for proving Theorem 4.7. 

Lemma A. 5 (Wihler (2009)). Suppose A and B are m x m symmetric positive- semidefinite ma- 
trices. Then, for p > 1, 

||aVp _ B^/Pfp < m'^P'^^/^WA - B\\f. 

Specifically, when p = 2, 

||AV2 _ SV2||^ < (^l/2||^ _ S||^)l/2_ 

Proof of Theorem ^.7. We only show the result on /3. since the result on /3 can be obtained in a 
similar way. We reuse the notations Tj's in the proof of Theorems 4.2, from which, 

V^G~]i'Ar,0^ -p*) = M + R, 

where M = ^G'^j^ Ar^iji^ - (3*) and R = y/^{G~x'^ - G~^l^)An{hn - P*)- % Theorem 4.3, 
M N{0,a'^Gx)- Then, it is sufficient to show that R wrt ||-||2. We have 

R = Ri + R2 + R3 ~ Rii 

where Ri = -B^Tj for i = 1,2,3,4 and -B„ = ^JniGxL ~ ^ Xn^-^^f^^^ ■ show each 

Ri converges to zero in probability, which finishes the proof. Before that, we first establish an 
inequality for ^G^n ~ ^Xn^F- Lemma A. 5, 

- Gxln\\P ^ (\/^l|Gx,n " G^X^n II ^)^''^ • 

Note that, by Lemma A. 3, — X1x||f for = o{n). Then, by Lemma A. 2, 

||Gx,n — Gx,n\\F < ll^nllpH^^^ — < || A„ |||. || il„ — Sx||f — > 0. 

Thus, by Lemma A. 2 again, 

II^X,n ~ ^^"'"nll-^ ^ II^X,n — Gx,n\\F ^ ll^n||F||S„ — 5]x||f- 
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By noting that q is a fixed integer, we have 

\\Gx]n - G-^fy < WAMVqW^u - ^xWf)'^' < \\An\\F{\\±n " ^x\\f)'^'. 

On Ri. We have 

WRih < V^Vd\\G~^f - G-fMAMT,-^pjT,\\2 

< Vn{d\\±n - ^x\\F)^^^AnfF\\To'\\F,d\\Tlh. 

By Lemmas A. 2 and A. 3, (i||5]n — = op(l) for = o{n). By assumption (D), ||A„||ir 

is bounded. By Lemmas A. 2 and A. 3 and assumption (Bl), for d = o{ii}^^), wpgl, ||Tq""'^||p,^ is 
bounded. We have, wpgl, ||ri||2 < S2Hnln/n. Then, ||i?i||2 < S2Hnlnl\/n. Thus, ||-Ri||2 = op(l) 
for S2 = o(^/(K„7n))- 
On i?2- We have 

II-R2II2 < \\Gx^!i — G-^^l^\\F\\An\\F\\TQ'^\\F\WnT2\\2 

< (d2iog(d)||s„-Sx||F)^/'||Aj||||ro-i||p,,(diog(d))-i/2||V^r2||2. 

By Lemmas A. 2 and A. 3, log((i)||5]„ — E^Hf is op(l) for d^(log((i))^ = o{n). By assumption (D), 
||A„||p is 0(1). By Lemmas A. 2 and A. 3 and assumption (Bl), for d = o{'n}/^), wpgl, ||TQ""'^||F,d 
is bounded. By Lemma A.4, {d\og{d))-'^/'^\\y/nT2\\2 = {d\og{d))-^/'^\\^^l^^^ j2 is Op(l) for 
d = o{^/n). Thus, R2 0. 

On i?3 and R^. First consider ^23. By noting that si = o{^/n / {\Kn)) , wpgl, 

||i23||2 < V^WGxf - G-fMA4F\\T,-'MT3\\2 

< Vn{d\\±n - ^x\\F)^^^An\\UTo^\\F,d\\T3h < AsiK„/^/n ^ 0. 

Thus, II-R3II2 = op(l). In the same way, II-R4II2 = op{l). □ 

B Supplementary Materials 

Supplementary Materials: Additional materials for Sections 1 to 4. (PDF) 
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Supplementary Materials for the Paper: 



Partial Consistency with Sparse Incidental Parameters 

by Jianqing Fan, Runlong Tang and Xiaofeng Shi 

C Supplement for Section 1 

In this supplement, we first show the method provided by Neyman and Scott (1948) does not work 
for model (1.1) and then explain which assumptions or conditions for the consistent results of the 
penalized methods in Zhao and Yu (2006), Fan and Peng (2004) and Fan and Lv (2011) are not 
valid for model (1.1). 

Although the modified equations of maximum likelihood method proposed by Neyman and Scott 
(1948) could handle "a number of important cases" with incidental parameters, unfortunately, it 
does not work for model (1.1). More specifically, consider the simplest case of model (1.1) with 
d=l: 

Yi = + Xil3* + ei, for i = 1, 2, • • • , n, 

where {et] are i.i.d. copies of A^(0, cr^). Using the notations of Neyman and Scott (1948), the 
likelihood function for [Xi.Yi) ispi = pi[^,a, ^l,\Xi,Y.l) = {V2^a)-^ el^^{-{2a'^)-^{Yi-^J*^-Xipf}, 
and the log-likelihood function is \ogpi = — log(\/27rcr) — {2a'^)~^{Yi — /x* — Xj/3)^. Then, the score 
functions are 



d\ogjH _ 



..2 = ^ = i + V.-^^-x./3)^ 



da 0" ' (T^ 
dlogpi _ 



U^i = = ^{Y^ -l4- X,p). 



From the equation oji = 0, we have jii = Yi — Xi/3. Plugging this /ij into (pn and cj)i2 (replacing 
fii with fii), we obtain (pn = and (j)i2 = 1/cr. Then, En = = and Ei2 = ^4>ii = 1/cr. 

Thus, En and Ei2 do only depend on the structural parameters {(3* and a). However, we then 
have = — En = and ^i2 = 4>i2 — Ei2 = 0. This means F„i = Fn2 = 0, independent of 
structural parameters! Consequently, the estimation equations degenerate to two = equations, 
which means that the modified equation of maximum likelihood method does not work for model 
(1.1). 
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Next, we explicitly explain which assumptions or conditions for the consistent results of the 
penalized methods in Zhao and Yu (2006), Fan and Peng (2004) and Fan and Lv (2011) are not 
valid for model (1.1). 

Zhao and Yu (2006) derive strong sign consistency for lasso estimator. However, their consis- 
tency results Theorems 3 and 4 do not apply to model (1.1), since the above specific design matrix 
X does not satisfy their regularity condition (6) on page 2546. More specifically, with model (1.1), 



where is the covariance matrix of the covariates. This means that some of the eigenvalues of 
goes to as n — )• oo. Then the regularity condition (6), which is 



does not hold any more. Thus the consistency results Theorems 3 and 4 in Zhao and Yu (2006) is 
not applicable for model (1.1). 

Fan and Peng (2004) show the consistency with Euclidean metric of a penalized likelihood 
estimator when the dimension of the sparse parameter increases with the sample size in Theorem 
1 on Page 935. Under their framework, the log-likelihood function of the data point Vi = {Xi,Yi) 
for each i from model (1.1) with random errors being i.i.d. copies of A^(0,cj^) is given by 



where oc means "proportional to". As we can see that log-likelihood functions with different z's 
might different since yUj's might be different for different i's. This violates a condition that all the 
data points are i.i.d. from a structural density in assumption (E) on Page 934. 

This violation might not be essential, however, since we could consider the log-likelihood func- 
tion for all the data directly. That is, we consider 




a^Ciia > a positive constant , for all a G M^^'^ such that ||q!||2 = 1, 



^Og fn{Vi, P) oc 



1 

2^ 



(y,-/i,-xf/3)2 




i=l 



i=l 



Then, the Fisher information matrix for (/i, (3) is given by 
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where In is the n x n identity matrix. Then, the Fisher information for one data point is 

n-^a-^In \ 
a~^^l) ■ 

It is clear that the minimal eigenvalue XYam{In+d{fJ', (3) /n) = n^^a'^ — )■ as n — )• oo. This violates 
the condition that the minimal eigenvalue should be lower bounded from in assumption (F) on 
Page 934. Thus, the consistency result Theorem 1 in Fan and Peng (2004) can not be applied to 
model (1.1). 

Fan and Lv (2011) "consider the variable selection problem of nonpolynomial dimensionality 
in the context of generalized linear models" by taking the penalized likelihood approach with 
folded- concave penalties. Theorem 3 on page 5472 of Fan and Lv (2011) shows that there exists a 
consistent estimator of the unknown parameters with the Euclidean metric under certain conditions. 
In Condition 4 on page 5472, there is a condition on a minimal eigenvalue 

mm\-aiji[Xjl](XiS)X i] > cn, 

6€No 

where Xj consists of the first s + d columns of the design matrix X. With model (1.1), this 
condition becomes 

XminiXjXj] > cn, 

which is 

X^i^[il/n)XjXj] = KnniC^l] > C, 

where C{\ is the matrix defined in Zhao and Yu (2006) and c is a positive constant. Since the min- 
imal eigenvalue AminlCn] converges to 0, the above condition does not hold. Thus, the consistency 
result Theorem 3 of Fan and Lv (2011) is not applicable for model (1.1). 

D Supplement for Section 2 

In this supplement, we provide the proofs of Lemmas 2.1 and 2.3 and Proposition 2.2. Before that, 
there are two graphs Figure 7 and 8 illustrating the incidental parameters and the step of updating 
the responses in the iteration algorithm with d = 1. 

Proof of Lemma 2.1. By subdifferential calculus (see, for example. Theorem 3.27 in Jalm (2007)), 
a necessary and sufficient condition for (/i, (3) to be a minimizer of -L(/x, (3) is that zero is in the 
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n 



-I„+<i(/x,/3) 



n — s 



7„ ^ 




oo 



{//;:l</<^,} 



Figure 7: An illustration of three types of //*'s, that is, large fi*, bounded and zero /Xg. The 
negative half of the real line is folded at under the positive half for convenience. For the penal- 
ized least square method with a soft penalty function and under the assumption of fixed d, the 
specification of the regularization parameter A is that k„ <C A, a^n < A, and A <^ min{/x*, \/n}. 
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Figure 8: An illustration for the updating of responses with d = 1. The solid black line is a fitted 
regression line. The dashed black lines are the corresponding shifted regression lines. The circle 
and diamond points are the original data points. The circle and triangle points are the updated 
data points. That is, the diamond points are drawn onto the shifted regression lines. 
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subdifferential of L at {jj,,l3), which means that, for each i, 

P = iX^X)-'X^{Y-p.), 

Yi- fii- Xjp = Asgn(/ij), if fii / 0, 

\Yi - Xf^\ < A, if fii = 0. 

Thus, the conclusion of Lemma 2.1 follows. □ 

Proof of Proposition 2.2. First, we show that, wpg 1, ||/3(^)||2 is bounded by 2VdCi + C2. For each 
k > 1, we have 

where Si = U^j^^SijiP^''-'^^) for i = 1, 2, 3 and Sij's are defined at the end of Section 2. Denote 
Ak-i as the event 

{Snif3^^-'^) = 0,5i2(/3('-^)) = 5^2,5i(/3('=-i)) = St^U St,; S2iP^''-'^) = S^^i, SsiP^'-'^) = S*,,}, 

where »S'* 's are defined at the beginning of Section 3. 
By Lemma 3.1, P{Ao) — >■ 1. Thus, wpgl, 

where Tq = S/n, Ti = %.Jn, = S,,+i,„/37n, Tg = S|^+i,„/n, r4(/3(°)) = Si,.,/3(°V^^ and = 
(§51, -S5*,)A/n. We will show that, wpgl, WT^'Tih < C2/4, ||ro-ir2||2 < 2VdCi, WT^^Tsh < 
C2/4, ||ro-ir4(/3(°))||2 < C2/4 and llTo-iTsIb < C2/4. Then, wpgl. 



||/3('^||2 < Y,\\To'T^\\2 < 2VdCi + C2. 

i=l 

On Tq ^Ti. For S2^n/n = o(l), wpgl, 

IIT0-IT1II2 < ||(-S)-1||f||-S^* II2 < 4||5]^i||^E||Xo||2-7n ^ 0. 



yq* \\2 — ^W-^X 



Thus, wpgl, ||r(7^ri||2 <C2/4. 

On T0-IT2. Wpgl, 



llTQ-^Talb < ||(-S)-1-S.,+i,„||f||/31|2 < 2||/d||irCi = iVdCi. 
n n 



On Tq-ITs. Wpgl, 

II— 1 —111 II''' e P 
\\Tq T^\\2 < 2||S^ l|F||-Ssi+l,nl|2 > 0. 

Thus, wpgl, WTQ^Tsh < C2/4. 

On T^^mp'-^''). For si/n = o(l), 

||To-ir4(/3(o))||2 < ^||(-S)-i-Si,.J|i.||/3(o)||2 < -2VdC2 A 0. 
n n si n 

Thus, wpgl, ||ro-ir4(/3(°))||2 < C2/4. 

On Tq-^Ts. For siX/n = 0(1), wpgl, 

llTo-^TsIb < 2||5]^i^^(||l555j|2 + ll^555J|2) A 0. 

Thus, wpgl, \\Tq^T5\\2<C2/4. 

Next, consider \\(32 — Pi\\2- Since /3*-"'^^ is bounded wpgl, by Lemma 3.1, ^1 occurs wpgl. Then, 

where T4(/3(^)) = (l/n)Si,,^/3(^). Thus, wpgl, 

/3(2)_/3(i)=s-%^^^(^(i)_/3(o)). 

It follows that, for si = o{n), wpgl, 

11/3(2) _^(i)||2 < ||S-%,,J|^||/3« -/3WII2 < {2Vdsi/n){AVdCi + 202)^0. 

Then, wpg 1, /3(2) = pW, which means that, wpgl, the iteration algorithm stops at the second 
iteration. 

Finally, for any K > 1, repeat the above arguments. Then, with at least probability Pn,K = 
P(Pl^Q^fc), which increases to one by Lemma 3.1, we have 

WpiK+i) _ < {2Vdsi/n)^{4VdCi + 2C2) = 0((si/n)^) ^ 0, 

and ||/3(^) II2 < 2VdCi + C2 for ah k<K. □ 

Proof of Lemma 2.3. First, we show a solution of (2.4) and (2.5) satisfies the necessary and suf- 
ficient condition in Lemma 2.1. Denote a solution of (2.4) and (2.5) as {fi,(3). Then /3 = 
{X^X)^^X^{Y — fi), which is exactly the first condition in Lemma 2.1, and, for each i = 
1,2,- •• ,n, {fJ.,f3) satisfies one of three cases: \Yi — Xf(3\ < A and fii = 0; Yi — Xj (3 > A 
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and jli = Yi — Xjp — X; Yi — Xjp < —A and fii = Yi — Xjp + A. If (/i, P) satisfies the first case, 
it satisfies tlie third condition in Lemma 2.1. If {fi,(3) satisfies the second case, then fii > and 
Yi — fii — Xf^ = A = Asgn(/xj), which means that the second case satisfies the second condition in 
Lemma 2.1. Similarly, the third case also satisfies the second condition in Lemma 2.1. Thus {fi,P) 
satisfies the necessary and sufficient condition in Lemma 2.1. 

In the other direction, suppose (/i,/3) satisfies the necessary and sufficient condition in Lemma 
2.1. Then, the first condition in Lemma 2.1 exactly (2.4). For each i, {fi,(3) satisfies one of three 
cases: fii = and \Yi — Xfp\ < A; /ij > and Yi — fn — Xfp = A; fli < and Yi — jli — Xjfi = —A. 
If (/i,/3) satisfies the first case, it satisfies the first case in (2.5). If (Ai/^) satisfies the second case, 
then fii = Yi — Xjp — A and Yi — Xjp > A, which means that (/i, /9) satisfies the second case 
of (2.5). Similarly, If {fi,(3) satisfies the third case, then it satisfies the third case of (2.5). Thus, 
(A, /9) satisfies (2.4) and (2.5). □ 



E Supplement for Section 3 

In this supplement, we provide the proofs of the results in Section 3. Before that, we point out that 
those two different sufficient conditions in Theorem 3.2 come from the different analysis on the term 
Se* . Each of the two different sufficient conditions does not imply the other. Specifically, on one 
hand, suppose the absolute values of /x*'s are all equal for i = si + 1, S2 + 2, •• • ,s. Then, ||M2ll2^'^ ~ 
^(2+(5)/2|^^|2+5 Yli=si+i If^il'^^^ — S2\fJ's\'^^^ ■ Thus assumption (A) holds automatically since 
52—7-00. This means that assumption (A) holds at least when the absolute magnitudes of /i*'s 
are similar to each other. For this case, there stiff exists a consistent estimator even if n/(K„7„) <^ 
S2 <^ n. On the other hand, suppose fi* = 7„ and the other /i*'s are all equal to a constant 
c > 0. Then, \\^^*\\l+' = [7^ + (^2 - l)c2](2+^)/2 and EL.^+i 1/^^ 1'+' = ll"-' + («2 - l)c2+^. If 
■S2 ^ 7n ^ n/iKnln), the previous two terms are both asymptotically equivalent to 7^"'"'^. Thus 
assumption (A) fails but the other sufficient condition holds. 

Proof of Lemma 3.1. The proof is the similar to that of Lemma 4.1 and omitted. □ 

Proof of Theorems 3.2. By Lemma 3.1, wpgl, the solution /3„ to (pn[P) = on Bc{(3*) is explicitly 
given by 

^„ = + Tq-i (Ti + T2 + T3- T^), 
where Tq = (l/n)S,,+i,„, Ti = (l/n)S^.^, T2 = (l/n)S|^+i „, Tg = (A/n)5sj^ and T4 = {X/n)Ss*^. 
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p —I p 

We will show that To — > 'S^ > with the Frobenius norm and Ti — > with the Euclidean norm 
for i = 1,2,3,4. Thus, by Slutsky's lemma (see, for example. Lemma 2.8 on page 11 of van der 
Vaart (1998)), /3„ is a consistent estimator of P*. 

On Tq . By law of large number, Tq — > Sx > 0. Then, by continuous mapping theorem, 

^o"' ^ > 0. 

On Ti: Approach One. Suppose S2 = o{n/{Kn^n))- Then, 

^ s ^ s 

||ri||2 < - \\XilJ*\\2 = - \\Xi\\2 ■ \lJ.*\ < S2Kn^n/n = o{l). 

i=si+l «=si+l 

On Ti: Approach Two. Under assumption (A), it follows (Six Ei=si+i )~^^^S5j2 
N{0,ld)- In fact, assumption (A) implies the Lyapunov condition for sequence of random vectors 
(see, e.g. Proposition 2.27 on page 332 of (van der Vaart, 1998)). More specifically, recall the 
Lyapunov condition is that there exists some constant 6 > such that 

i=si+l j=si+l 

Then, by assumption (A), 



2+S 



i=si+l j=si+l j=si+l i=si+l 

where Amin > is the minimum eigenvalue of Sx- Then, 

j=si+l i=si+l 

= -( E f^f?^'\\^T\\FOp{l) < i(s27^)V20p(l) < 4=7nOp(l) = Op(l), 
j = Sl + l ^ 

where stands for the Euclidian and Frobenius norm, respectively. 
On T2. By law of large number, T2 = op{l). 
On and T^^. By noting A <^ y/n, 

1 „ ,ysT„ 1 ^ „ A 



ll^slb = ||A-55* II2 = A ||^=5s* II2 < —j=Op{l) = op(l). 

Thus T3 = op(l). In the same way, we can show that Ti^ = op(l) holds. □ 

Proof of Theorems 3.4 and 3.6. It is sufficient to provide the proof for the case where the sizes of 
index sets S21 = {1 < i < si : /x* > 0} and = {1 < i < si : /x* < 0} are both asymptotically 
si/2 and 6 = 2. 
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From the proof of Theorems 3.2, wpgl, 

Let r„ be a sequence going to infinity. Then, rn{Pn~ P*) — ""^(^1 + ^2 + ^3 — V4), where Vi = r„Ti, 
V2 = r„T2, V3 = r„T3, V4 = r„T4 and Tj's are defined in the proof of Theorem 3.2. Next we derive 
the asymptotic properties of Tq and T^'s, from which the desired results follow by Slutsky's lemma. 
On Tq. By the proof of Theorem 3.2, Tq^ — ^ Sj^^ 



On Vi: Approach One. If r„ = ^/n and S2 = o{\/n / {K.n'yn)) , then 
1 1 - ^ 1 1 

\\Tl\\2 = \\rn-%*\\2<rn- \\Xi\\2-\yL*\ <rn-S2linln = ^^S2Kn'yn = o{l). 

n 12 n ^ — ' n 

i=si+l 

Thus, if = y/n or r„ <C -y/n and S2 = o{y/n / (Knln)) , then Ti = op(l). 
On Vi: Approach Two. If r„ = y/n, then 



n '^12 n Z^r, '^12 Jn Dr. 12 



where = IIM2II2 = Cl2i=si+i f^i'^)^^'^ ■ There are three cases on Dn/\/n or D'^/n. If D'^/n — )• 0, 
then Ti — ^ 0. If D^/n — )■ 1, then Ti — ^ A^(0, Sx)- If D'^/n — )• 00, it means that r„ = -y/n is too 
fast. Let rn ~ n/Dn = \fn^J njD^ <^ ^Jn. Then T\ iV(0, Sx); 



On ^2- If = ^fn, then r2 ^ A^(0, cj^Ex). Thus, if r„ ^/n, T2 ^ 0; if r„ > 
T2 — > 00. 

On V3 and V4. First consider T3. Denote #(•) as the size function. If r„ = i/n, then 



T3 = Xr„-Ss* = X.hlIl—^=Ss* . 
n V n y#0^ 

Note that "^{821) = si/2. There are three cases on X^/si/{2n). If AA/si/(2n) ^ 0, then T3 0. 
Note that Xy/si/{2n) — > is equivalent to si = o{2n/X^). If X^Jsl/{2n) — ^ 1, then Ts 
iV(0, Six). Note that X^Jsl/{2n) ^ 1 is equivalent to si ~ 2n/X^. If A^Vl 

2n) —7- 00, it means 

~ is too large. Let ~ n/{X^J (si/2)) = -y/nv^/lAysI) < ^/n. With this rate r^, 
Ts ^ Af(0, Sx). Note that X^Jsl/2n — 00 is equivalent to si » 0{2n/X^). In the same way, T^^ 
can be analyzed and parallel results can be obtained. □ 

Proof of Theorem 3. 7. The proof is similar to that of Theorem 4.4 and omitted. □ 
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E.l Supplement for Subsection 3.1 

Proof of Theorem 3.8. Denote Iq = {si + 1, si + 2, • • • , s = si + S2, s + 1, ■ ■ ■ , n}. Note that 
S2 = o{^yn / (Knjn)) ensures that is consistent by Theorem 3.2. By Theorem 3.7, P{Io = Iq} goes 
to 1. Then, 

^ = i?i + i?2 + ro-i(Ti + r2), 

where Ri = (XT X^J-^XT I^^J/q / lo} and R2 = -(Xj^X/J-^Xf^Fj^/o / Iq} and T.'s are 
defined in the proof of Theorem 3.2. The proof for the consistency is similar to that of Theorem 
3.2 and is omitted. Next we show the asymptotic normahty. We have, 

rn0 - P*) = rnRi + rnR2 + T^HVi + V2), 

where V^'s are defined in the proof of Theorem 3.6. Since P{y/nRi = 0) > P{Iq = /q} ~^ 1; we 
have y/nRi = op(l). Similarly, y/nR2 = op(l). From the analysis on V^'s in the proof of Theorem 

3.6, the asymptotic distributions follows by Slutsky's lemma. □ 

Proof of Lemma 3.9. When assumption (A) or S2 = o[n/ {unln)) holds, the penalized estimators /3 
and /3 are consistent estimators of /3* by Theorems 3.2 and 3.8. Denote C = {Iq = Iq}. By Theorem 

3.7, C occurs wpgl. Then, cr^ = TC + cP'C'^ where T = an||l^/o ~ ^/o/^lll and a„ = l/(n — s\). It 
is sufficient to show T cr^. We have T = ^^^^ T^, where Ti = Y.l=s^+i[^'i if^* " ^2 = 

(3) and Tg = 2a„ X^^^^^+i ^J'*(-i■ It is straightforward to show that T2 — > a and each other Tj — > 
under the condition S2 = oinj'^'^ and by noting that 3 ~^ ■ Then o" is a consistent estimator 
oia. □ 

E.2 Supplement for Subsection 3.2 

In this supplement, we consider a special case with exponentially tailed covariates and errors. We 
begin with a lemma on Orlicz norm with ij^i. Suppose {Zi}f^i is a sequence of random variables 
and {Zi}f^^ is a sequence of d-dimensional random vectors with Zi = {Zn, Zi2, • • • , Zij)'^ . From 
Lemma 8.3 on Page 131 of Kosorok (2008), we have the following extension. 

Lemma E.l. If for each 1 < i < n and 1 < j < 

P{\Zi\ > x) < cexp{-^ • ^ } and P{\Zij\ > x) < cexp{-^ • ^ }, 
2 ax + 2 ax + b 



with a,b >0 and c > 0, then 



II max \Zi\\U < K{a{l + c) log(l + n) + + c) Vlog(l + n)}, 

l<i<n 

II max llZillslLi < ir{a\/(i(l + cd) log(l + n) + + cd) Vlog(l + n)}. 

l<i<n 

where K is a universal constant which is independent ofa,b,c, {Zi} and {Zi}. 

Proof of Lemma E.l. The proof for random variables {Zi\ is the same to the proof of Lemma 8.3 
on Page 131 of Kosorok (2008). For random vectors {Z^}, 

P{\\Zi\\2 >x)< P(max \Zij\ > x/Vd) < J" P{\Zij\ > x/Vd) < c^exp{-- / }, 
i<j<d ■' ^ ^1 ji 2a'x + b' 

i=i 

where a' = aVd, h' = bd and c' = cd. Then, by the result on random variables, the desired result 
on random vectors follows. □ 

Now, suppose, for every x > 0, 

P{\ei\ >x)< ciexp{-J ^— -} and P{\Xij\ > x) < C2exp{-J ^7^}' (^-1) 

2 aix + bi 2 a2X + 02 

with Oj, 6j > and Cj > for i = 1, 2. By Lemma E.l, it follows 



II max |ei||Li < K{ai(l + C2) log(l + n) + ^/bi{l + ci) 0og(l + n)}, 

l<j<71 

II max ||Xi||2|Li < K{a2\^{l + C2d) log(l + n) + ^62^(1 + C2(i) Vlog(l + n)}. 

l<j<71 

Thus, from the inequality (3.8), if oi > 0, let 7„ » log(n); otherwise, let 7„ » y^log(n). 



Similarly, if 02 > 0, let k„ » log(n); otherwise, let k„ » Y^log(n). Then, such 7„ and k„ satisfy 
the condition (2.2). Suppose both oi and 02 are positive, which means both and Xjj's have 
exponential tails. As before, set Kn = 7n = log(?T-)T„. For this case, the regularization parameter 
specification (3.1) becomes log(n)r„ ^ A ^ min{/x*, \/n}. 

At the end of this supplement, we simply list explicit expressions of k„ under different assump- 
tions on the covariates for the case with a diverging number of covariates, which are the extension 
of the results in Section 3.2. The magnitude of k„ becomes larger than that for the case with d 
fixed while 7„ keeps the same. Specifically, if Xq is bounded with Cx > 0, then k„ = \fdCx- 



If Xq follows a Gaussian distribution A^(0,5]x), then k„ = y 2(icr^[(3/2) log(d) + log(n)]. If the 
Orlicz norm ||Xoj||^ exists for 1 < j < d and their average (1/d) ^j=il|A'oj||^ is bounded, then 
Kn ^ d'il'^^{n); for instance, iiip = ipp with p > 1, then k„ ^> (i(log(n))^/^. Finally, if the data {Xi} 



XI 



satisfies the right inequahty of (E.l) with 02 > 0, that is, each component of Xi is sub-exponentially 
tailed, then k„ ^ d^/^log(n). It is worthwhile to note that these expressions of Hn depend on a 
factor involving the diverging number of covariates d, which will influence the specification of the 
regularization parameter and the sufficient conditions of all the theoretical results in Section 4. 

F Supplement for Section 4 

In this supplement, we provide the proofs of the lemmas in Sections 4 and some related results. 

We first extend Proposition 2.2 to the case with d — t- 00 and d <^ n. Before that, we list two 
simple lemmas for a diverging d. Suppose {^j} is a sequence of i.i.d. copies of ^q, a d-dimensional 
random vector with mean zero. Denote cj| = (l/d) Yrj=i Varf^oi]. 

Lemma F.l. Suppose fT| is hounded. If d/n = o{l), then 

1 " 
n ^-^ 

i=l 

Lemma F.2. Suppose fT| is bounded. If d/n = o{l), then 

1=1 

Suppose the specification of the regularization parameter is given by 

dKn <C A, a^n < A, and A <^ fj*, (F.l) 
where a is a constant greater than 2. 

Proposition F.3. Suppose assumptions (Bl) and (E) hold and the regularization parameter sat- 
isfies (F.l). Suppose there exist constants Ci and C2 such that ||/3*||2 < Ci^/d and ||/3^'^^||2 < C2^fd 
wpgl. If the regularization parameter satisfies (3.1), siAK„/(n\/d) = o(l) and S2Hnln / (n^fd) = 
0(1), then, for every K > 1, with at least probability Pn.K which increases to one as n ^ 00, 
\\j3(K+i) _ p{K)y < 0{{VdsiKl/n)^d) and \\(3^''^\\2 < (2Ci + C2)d for all k < K. Specifically, 
wpgl, the iterative algorithm stops at the second iteration. 

Proof of Proposition F.3. Reuse the notations in the proof of Lemma 2.2. First, we show that, 
wpgl, ||/3(^)||2 < (2Ci + C2)d. For each k>l, 

xii 



Since the regularization parameter satisfies (F.l), it is easy to check that the conclusion of Lemma 
4.1 continues to hold, which implies P{Ao) — t- 1. 
Thus, wpgl, 



We will show that, wpgl. 



||To-^ri||2 < (C2/4)d, 
\\To^T2\\2 < 2Cid, 
WTo^nh < {C2/4)d, 
\\T,'Uf3^^^)\\2<iC2/4)d, 
IK^nh < (C2/4)d. 



Thus, wpgl. 



\\(3^''^\\2<^\\To-'m2<{2C, + C2)d. 

i=l 

On Tq^Ti. Under assumption (Bl), for S2Kn'jn/{nVd) = o(l), wpgl, 

llTo-^rills < ||(-S)~^||F||-Sg* II2 <2||5:^i||^,d^K„7,d^0. 
n n 12 nv" 

Thus, wpgl, \\TQ^n\\2<C2d/4. 
On Tq'T2. Wpgl, 

||ro-ir2||2<||(-s)-^-s.,+i,„||F||/3*||2 

n n 



< \\id\\FCiVd+\\{-s)-^-§i,sAFCiVd 

n n 

<Cid+||(-S)-i||i.||-Si,,J|^C7i^/d, 



and 



1 



= -^^11X^112 < — K^. 

1=1 



Thus, Under assumption (Bl), for siK^/n = o(l), wpgl, 

\\To^T2\\2 < Cid + 2\\l^^^\\F,d^^KlCiVd<2Cid. 

n 



Xlll 



On Tq ^T^. Under assumptions (Bl) and (E), for log{d)/n = o(l), wpgl, 



- , .. , , - , , . .. , Sl- 

n n \/n 



<^Vp?)2||s-||,,Op(l)^0. 



Thus, wpgl, llTo-iTalb <C2d/4. 

On Tq"^T4(/3^'^^). Under assumption (Bl), for sik'^/u, wpgl, 

< Vd2\\^],'\\F/-KlC2Vd A 0. 

n 

Thus, wpgl, ||ro-ir4(/3(o))||2 < C2d/4. 

On Tq^T^. Under assumption (Bl), for siKnX/{n\/d) = o(l), wpgl, 

\\T,-'n\\2 < ^/^||(^S)-^||F,d^(||55,^||2+ ll^sjjb) 

< Vd2WS7}\\pd^siKn < Cid/i. 
n 

Next, consider WP^-Pih- Since (3^'^^ < {2Ci + C2)d wpgl, the conclusion of Lemma 4.1 holds, 
which implies occurs wpgl. 
Then, 

/3(2) = T-'Ti + T^'T2 + T^^Ts + T^^T^ifS'^^^) + T^^n, 

where 

r4(/3«) = -Si,.,/3(^). 
n 

Thus, wpgl, 

/3(2)_/3(i)=s-%^^^(/3(i)_/3(o)). 
Thus, for d^/'^siKl/n = o(l), wpgl, 

\\p(^)-(3W\\2 < Vip,,||-Si,,J|^||/3(i) -/3W|b 
n n 

< 2\\^^^\\F,d^-Kl{2Ci + C2)d < d^/^iKl/n ^ 0. 
n 

Thus, wpg 1, /3(2) = which means that, wpgl, the iteration algorithm stops at the second 

iteration. 
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For any K > I, repeating the above arguments, with at least probability pn,K = P{C\k=o-^k): 
which increases to one, we have P^'^^ < (2Ci + C2)d for k < K and 

\\f3iK+l) _ p(K)y < {2\\^-^y,,Vd'-^Klf{2Ci + C2)d < (VdSiKlM^'d ^ 0. 

n 

This completes the proof. □ 

Next, we provide the proofs of Lemmas A. 2, A. 3 and A. 4 in the appendix. 

/ — P 

Proof of Lemma A. 2. Let E = An — A. Note that > l/yd. Then, r(i||^||ir — > implies 
P 

\\^\\F,d — > 0. Thus, wpgl, is bounded by a constant C > 0. By Lemma A.l, 

II „ iiF,d_ii - i-C\\Ey,, 

Therefore, 

r.llA. -A y<C ,_c\\E\\,,^ ^'- 
This completes the proof. □ 
Proof of Lemma A. 3. For any 5 > 0, we have 

k=l 1=1 1=1 

Thus, P(rrf||S„ - SxIIf >(^) < ^|^r2d^/(n(52) = o(l) by assumption (C2) and for r^d^/n ^ 0. 
Thus, S„ is a consistent estimator of wrt □ 

Proof of Lemma A. 4- Let = ^/d\ogd and Ci > \/2cT^^inax- Then 

^'(ii^E^jb > <x;ni^E-i > 

^ ^ V"- ^ <7j aj^/d 

1=1 j=i i=i J 

where fij is the standard deviation of ^Qj. By Berry and Esseen Theorem (see, for example, P375 
in Shiryaev (1995)), there exists a constant C2 > such that -P(||(l/v^) Z^iLi ^ilb > O-dCi) < 
Ti + 2T2 , where 



r, = j:P(|Ar(o,l)|>^), T, = Y^':^ 

J = l J V J V 
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By noting cP = o{n), 

Therefore, \\{l/Vn)E7=i^ih = Op{ad). □ 
Next result is on the consistency of the penahzed two-step estimator /3. 



Theorem F.4 (Consistency on f3). Suppose the assumptions and conditions of Theorem 4-2 hold. 



Ifrd > 1/Vd, then ^ ^ (3* wrt 



Proof of Theorems F.4- By Theorem 4.2, f3 — > /3* wrt r(i||-||2. By Theorem 4.4, P{Io = /q} — ^ 1 
for r^ > 1/Vd, where Iq = {si + 1, si + 2, • • • , s = si + S2, s + 1, ■ ■ ■ , n}. Then, wpgl, 

(3* = Ri + R2 + T^^Ti + Tp-^Ta, 

where Ri = (xT X^J-^XT r^J/o / /q}, R2 = -{XlXj,r^XlYj,{io + h} and T^'s are 
defined in the proof of Theorem 4.2. Then, 

TdW - /3II2 < rd||i2i||2 + rrf||i?2||2 + ||To-ii.,drrf\/d||Ti||2 + WT^^WF^'^dMXPih. 

Since i-*(||iii||2,d = 0) > P{/o = -^0} — ^ li we have i?i = op(l). Similarly, i?2 = op(l). By the proof 
of Theorem 4.2, ||rQ~^||p^ is bounded and r(^\/d||7i||2 for i = 1, 2. Thus, 3 /3* wrt 7'rf||-||2 
and Trf > \j\fd. □ 

Finally, we provide some additional results on the asymptotic distributions of /9 and /3 with a 
different scaling. Specifically, the scaling in Section 4 is Next, we consider another natural 

scaling 

Theorem F. 5 (Asymptotic Distribution on /3) . Suppose assumptions (Bl'), (B2), (C), (D) and 
(E) hold. If d^ log d = o{n), si = o{^/n/ (Xdun)) and S2 = o{'\/n / {dun^yn)) , then 



nA„S^^(/3„ - (3*) ^ iV(0, a'G). 

Theorem F.6 (Asymptotic Distribution on /3). Suppose the assumptions and conditions of Theo- 
rem F.5 hold except the condition si = o{y/n/ (Xdun)) . Then 
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By Theorems F.5 and F.6, Wald-type confidence regions can be constructed. In order to validate 
these confidence regions with estimated a and Sx, we need Lemma 4.6 and the fohowing result. 

Theorem F.7 (Asymptotic Distributions on /3 and /3 with Suppose the assumptions and 

conditions of Theorem F.5 hold. If {\og{d))^ = o{n), then 

Similarly, suppose the assumptions and conditions of Theorem F.6 hold. //d^(log(d))^ = o{n), then 

V^An±l!\~P - /3^) A N{0,a^G). 

Remark 6. A comparison of the assumptions and conditions of Theorem F.7 with those of Theo- 
rems F.5 and F.6 reveals that a much stronger requirement on d is needed to ensure S„ is a good es- 
timator of Y^x- Precisely, the former require that d^(log(d))^ = o{n) and the latter d^ log{d) = o{n). 
This stronger requirement on d is a price paid for estimating Sjf. 

Remark 7. The condition on the dimension d in Theorems 4.3 and 4.5 is d^ log((i) = o(n), slightly 
weaker than the condition d^ log(d) = o(n) in Theorems F.5 and F.6. Accordingly, The condition 
on the dimension d in Theorem 4.7 is d^(log((i))^ = o(n), slightly weaker than the condition 
(i^(log((i))^ = o(n) in Theorem F.7. This means that the scaling y/nAn is slightly better than the 
scaling y/nAn^x terms of the condition on d. Further, the former scaling is more suitable for 
constructing confidence regions for some entries of /3*. 

At the end of this supplement, we provide the proofs of the above theorems. 

Proof of Theorems F.5. Reuse the notations Tj's in the proof of Theorems 4.2, from which, 

v^A„i:^/'(^„ - /3*) = Vi + V2 + V^- Vi, 

where Vi = B^Ti for i = 1,2,3,4 and B„ = y/nAnT;][^TQ^ . We will show V2 ^ N{0,a'^G) and 
other Vi's are op(l), from which the desired result follows by applying Slutsky's lemma. 

On Vi. We have ||Vi||2 < yn'^ll^nll-FllS^ ||F,d||^o~ ll-F.dll^ilb- By assumption (D), is 

1/2 

bounded. By assumption (B2), bounded. By Lemmas A. 2 and A. 3 and assumption 

(Bl), for d = o{n^/^), wpgl, ||Tq ^Wfa is bounded. Further, wpgl, ||Ti||2 < ^S2Kn7n. Then, 
11^1 II2 ^ ;^S2dK„7„,, where < means that the left side is bounded by a constant times the right 
side, as noted at the beginning of the appendix. Thus, ||Vi||2 = op(l) for S2 = o{\/ri / (dKn^n)) ■ 
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On V2. We have V2 = V21 + V22, where 

V21 = V^A^E^'/'Ta, V22 = V^An-sH^TQ^ - S^')T2. 
First, consider V21. We have V21 = \/{n — si) / nY^^=sx+i where 



^/n- Si 



On one hand, for every S > 0, Y17=si+i^\\^";i\\2{\\^n,i\\2 > S} < {n — si)E||Z„_o||2/<^ ^-iid 

(n — sij^ 

Thus, by assumptions (Bl'), (C) and (D), EiL^i+i lEll^n.illidl^n.ilb > 5} ^ for d = o(^). 
On the other hand, X]r=si+i Cov(Z„^i) = a^AnA^ — )■ cj^G. Thus, by central hmit theorem (see, 
for example. Proposition 2.27 in van der Vaart (1998)), V21 N{0,a^G). Next, consider ^22- 
We have 



||V^22||2 < \\AM\'S]l^\\F,dd{logid))'/^\\T^^ - 5]^i||i.((ilog(d))-i/2||^T2||2. 

1/2 

By assumption (D), ||A„||i? is 0(1); By assumption (B2), Wp^d is 0(1); by Lemmas A. 2 

and A.3, (i(log(d))^/2||2^-i _ Y:~^\\f is op(l) for (flog{d) = o{n); By Lemma A.4, together with 
assumption (E), {dlogid))~y^\\^T2h = {dlog{d)y^/^\\j^8l^+^j2 is Op(l) for d = o{^). 
Thus, 1^22 0. By slutsky's lemma, V2 N{0,a'^G). 

On V3 and V4. First consider V3. By noting that si = o{^/n / (XdKn)) , wpgl, IIV3II2 < 
d^/n\\An\\F\\'^x WfaW^o ||F,d||?3||2 < dAsiK„/^ 0. Thus, ||V'3||2 = op{l). In the same way, 
i^lb = op{l). This completes the proof. □ 



1 /2 ~ ~ ~ 

Proof of Theorem F.6. From the proof of Theorem F.4, we have \fnA^^ ^ [fi — (3*) = Ri+ R2 + 
Vi + V2, where iii = ^/nAn^^ Ri, R2 = \fnAn^-l^ R2, and RiS and V^'s are defined in the 
proofs of Theorems F.4 and F.5. Since P(||-Ri||2 = 0) > P{Iq = Iq} — )• 1, we have Ri = op(l). 
Similarly, R2 = op(l). By the proof of Theorem F.5, Vi = op{l) and V2 -^(0, cr^G). Thus, the 
asymptotic distribution of is Gaussian by Slutsky's lemma. □ 
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Proof of Theorem F. 7. We only show the result on (3. smce the result on (3 can be obtained in a 
similar way. We reuse the definitions of Tj's in the proof of Theorems 4.2, from which, 

where M = y/^An'E]['^ 0^ - P") and R = y/^Anii^^"^ - 'Sx^)i(3n - (3"). By Theorem F.5, 
M — ^ N{0, a'^G). Then, it is sufficient to show that R — ^ wrt ||-||2- We have 

R = Ri + R2 + R3- Ra, 

where Ri = -B„Tj for i = 1,2,3,4 and S„ = ^/nAn{^n — 'S}'x^)Tq'^. We will show each Ri 
converges to zero in probability, which finishes the proof. 

On Ri. By Lemma A.5, HS^^ - sJ/^Hi. < {d^/'^\\tn - T.xWfY''^- Then, 

\\Rih < V^WAM^]!^ - ^T\\F\\T^'\\F\\Tih 

< ^dWAnMW^n - ^x\\F,df'^\\T^^\\F,d\\Tlh- 

By assumption (D), HAnUi? is bounded. By Lemma A. 3, — SxIIf,^ = op(l) for d = o{n}/^). 
By Lemmas A. 2 and A. 3 and assumption (Bl), for d = o{v}/'^), wpgl, ||TQ""'^||F,d is bounded. 
We have, wpgl, ||ri||2 < ^S2Kn7n- Then, ||i?i||2 < ^S2(iK„7„. Thus, ||i?i||2 = op(l) for S2 = 

0{^/n/{dKnln))- 

On i?2. We have 

P2II2 < \\A^\\Fd{\og{d)fl^\\t]!^ - 5]^/'||p||ro-i||p,,(dlog(d))-i/2|| V^r2||2, 

and 

d(log(d))V2||Sy' - S^/'IIp < {d'>'Hog{d)\\±n - ^xyf". 

By assumption (D), \\An\\F is 0(1); by Lemma A.3, d^/^ log(d)||l]„-Sx||F = op(l) for (i^(log(d))2 = 
o(n); by Lemmas A.2 and A.3, d{\og{d) f'^\\T^^^ - T.~^\\f is op(l) for d^\og{d) = o(n); by Lemma 
A.4, (dlog(d))-i/2||^r2||2 = (dlog(d))-i/2||-i^S|^+, „||2 is Op(l) for d = o{^). Thus, R2 A 0. 
On i?3 and R4,. First consider R-^. By noting that si = o{y/n / {Xdun)) , wpgl, 

\m\2 < dV^\\An\\F{\\±T - ^T\\F,d)'^^\\To'\\F,d\\T3\\2 < dXsiKn/V^ ^ 0. 

Thus, II-R3II2 = op(l). In the same way, II-R4II2 = op{l). □ 
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